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diversity and depth. 


N. Duane Melson 

NASA Langley Research Center 

Steve F. McCormick and 
Tom A. Manteuffel 
University of Colorado at Denver 


The use of trademarks or manufacturer's names in this publication does not 
constitute endorsement, either expressed or implied, by the National Aeronautics and 
Space Administration. 


iii 



INTENTIONALLY BLANK 


P**€fcOfNG PAGE BLANK NOT FILMED 


ORGANIZING COMMITTEE 


Joel Dendy 

Los Alamos National Laboratory 

Craig Douglas 

IBM and Yale University 

Paul Frederickson 
RIACS 

Van Henson 

Naval Postgraduate School 

Jan Mandel 

The University of Colorado at Denver 


Duane Melson 

NASA Langley Research Center 

Seymour Parter 

University of Wisconsin - Madison 

Joseph Pasciak 
Brookhaven National Lab 

Boris Rozovski 

University of Southern California 

John Ruge 

University of Colorado at Denver 

Klaus Stueben 

Gesellschaft f. Math. u. Datenverarbeitung 

James Thomas 

NASA Langley Research Center 

Pieter Wesseling 
Delft University 

Olof Widlund 

Courant Institute 


CONTENTS 


Preface iii 

Organizing Committee iv 


Part 1 


A Multigrid Solver for the Semiconductor Equations 1—7 

Bernhard Bachmann and Asea Brown Boveri 


FAS Multigrid Calculations of Three Dimensional Flow Using Non-Staggered Grids ... 17 

D. Matovic, A. Pollard, H. A. Becker, and E. W. Grandmaison 


Multigrid and Cyclic Reduction Applied to the Helmholtz Equation 

Kenneth Brackenridge 

Uniform Convergence of Multigrid V-Cycle Iterations for Indefinite and Nonsymmetric 
Problems 

James H. Bramble, Do Y. Kwak, and Joseph E. Pasciak 

A Multilevel Cost-Space Approach to Solving the Balance Long Transportation 
Problem 

Kevin J. Cavanaugh and Van Emden Henson 

Vectorization and Parallelization of the Finite Strip Method for Dynamic Mindlin 
Plate Problems 

Hsin-Chu Chen and Ai-Fang He 

Domain Decomposition Methods for Nonconforming Finite Element Spaces of 
LaGrange-Type 

Lawrence C. Cowsar 

Nested Krylov Methods and Preserving the Orthogonality 

Eric De Sturler and Diederik R. Fokkema 


31 ^3 

43 -y 
61 

77 

93 


Implementing Abstract Multigrid or Multilevel Methods 1 27 - ^ 

Craig C. Douglas 

Numerical Solution of Flame Sheet Problems with and without Multigrid Methods . . . 143 

Craig C. Douglas and Alexandre Em 

A Mixed Method Poisson Solver for Three-Dimensional Self-Gravitating Astrophysical 

Fluid Dynamical Systems 159“"'' 

Comer Duncan and Jim Jones 


Multigrid Methods for Differential Equations with Highly Oscillatory Coefficients .... 175 

Bjorn Engquist and Erding Luo 

Application of Multigrid Methods to the Solution of Liquid Crystal Equations on a 

SIMD Computer 191 --/jSs 

Paul A. Farrell, Arden Ruttan, and Reinhardt R. Zeller 


v 


An Adaptive Multigrid Model for Hurricane Track Prediction 

Scott R. Fulton 

Relaxation Schemes for Chebyshev Spectral Multigrid Methods 

Yimin Kang and Scott R. Fulton 

Multigrid Methods for a Semilinear PDE in the Theory of Pseudoplastic Fluids 
Van Emden Henson and A. W. Shaker 

A Multilevel Adaptive Projection Method for Unsteady Incompressible Flow . 
Louis H. Howell 


207 

215 r}£ 


231 ' ) b 
243 ^ ^ 


Wavelet Multiresolution Analyses Adapted for the Fast Solution of Boundary Value 

Ordinary Differential Equations 259 

Bj5m Jawerth and Wim Sweldens 

Comparison of Locally Adaptive Multigrid Methods: L.D.C., F.A.C. and F.I.C 275 

Khodor Khadra, Philippe Angot, and Jean-Paul Caltagirone 

Multi-Grid Domain Decomposition Approach for Solution of Navier-Stokes Equations 

in Primitive Variable Form 293 

Hwar-Ching Ku and Bala Ramaswamy 

Compressible Turbulent Flow Simulation with a Multigrid Multiblock Method 305 

Hans Kuerten and Bernard Geurts 

A Nonconforming Multigrid Method Using Conforming Subspaces 317 

Chang Ock Lee 

Multigrid Method for Integral Equations and Automatic Programs 331 

Hosae Lee ~ ^ 


1 1 4 

' 2 * 

-*1 

•‘%'SL 


An Object-Oriented Approach for Parallel Self Adaptive Mesh Refinement 
on Block Structured Grids 

Max Lemke, Kristian Witsch, and Daniel Quinlan 



Part 2* 


Optimal Resolution in Maximum Entropy Image Reconstruction from Projections 
with Multigrid Acceleration 

Mark A. Limber, Tom A. Manteuffel, Stephen F. McCormick, and David S. Sholl 

Flow Transition with 2-D Roughness Elements in a 3-D Channel 

Zhining L iu, Chaoqu n Liu, and Stephen F. McCormick 

Multilevel Methods for Transport Equations in Diffusive Regimes 

Thomas A. Manteuffel and Klaus Ressel 

Analysis of Multigrid Methods on Massively Parallel Computers: Architectural 

Implications 

Lesley R. Matheson and Robert E. Taijan 


♦Part 2 is presented under separate cover. 


361 

377 

393 

405 


vi 


Time-Accurate Navier-Stokes Calculations with Multigrid Acceleration 423 

N. Duane Melson, Mark S. Sanetrik, and Harold L. Atkins 

MGGHAT: Elliptic PDE Software with Adaptive Refinement, Multigrid and High 

Order Finite Elements 439 

William F. Mitchell 

Looking for O(N) Navier-Stokes Solutions on Non-Structured Meshes 449 

Eric Morano and Alain Dervieux 

The Block Adaptive Multigrid Method Applied to the Solution of the Euler 

Equations 465 

Nikos Pantelelis 

Multigrid Schemes for Viscous Hypersonic Flows 481 

R. Radespiel and R. C. Swanson 

Layout Optimization with Algebraic Multigrid Methods 497 

Hans Regler and Ulrich Rude 

Optimal Convolution SOR Acceleration of Waveform Relaxation with Application 

to Semiconductor Device Simulation 513 

Mark Reichelt 

A Multigrid Method for Steady Euler Equations on Unstructured Adaptive Grids .... 527 
Kris Riemslagh and Erik Dick 

Two-Level Schwarz Methods for Nonconforming Finite Elements and Discontinuous 
Coefficients 543 

Marcus Sarkis 

An Automatic Multigrid Method for the Solution of Sparse Linear Systems 567 

Yair Shapira, Moshe Israeli, and Avram Sidi 

A Multigrid Algorithm for the Cell-Centered Finite Difference Scheme 583 

Richard E. Ewing and Jian Shen 

A Semi-Lag rangian Approach to the Shallow Water Equations . 593 

J. R. Bates, Stephen F. McCormick, John Ruge, David S. Sholl, and 
Irad Yavneh 

Multigrid Solution of the Navier-Stokes Equations on Highly Stretched Grids with 

Defect Correction 605 

Peter M. Sockol 

The Multigrid Preconditioned Conjugate Gradient Method 621 

Osamu Tatebe 

Mapping Robust Parallel Multigrid Algorithms to Scalable Memory Architectures . . . 635 

Andrea Overman and John Van Rosendale 

Unstructured Multigrid Through Agglomeration 649 

V. Venkatakrishnan, D. J. Mavriplis, and M. J. Berger 


vii 


Multigrid Properties of Upwind-Biased Data Reconstruction 

Gary P. Warren and Thomas W. Roberts 

On the Prediction of Multigrid Efficiency Through Local Mode Analysis . . . 

R. V. Wilson 

Numerical Study of a Multigrid Method with Four Smoothing Methods for the 
Incompressible Navier-Stokes Equations in General Coordinates . ... 

S. Zeng and P. Wesseling 



viii 




Parti 


N94- 23 6.74 


A MULTIGRID SOLVER FOR THE SEMICONDUCTOR EQUATIONS 


Bernhard Bachmann 

Institut fur Angewandte Mathematik der Universitat Zurich, 

Ramistr.74, 8001 Zurich, Switzerland 
and 

Asea Brown Boveri, Corporate Research, 

5405 Baden-Dattwil, Switzerland. 

SUMMARY 

We present a multigrid solver for the exponential fitting method, applied to the current con- 
tinuity equations of semiconductor device simulation in two dimensions. The exponential fitting 
method is based on a mixed finite element discretization using the lowest-order Raviart-Thomas 
triangular element. This discretization method yields a good approximation of front layers and 
guarantees current conservation. The corresponding stiffness matrix is an M-matrix. ’’Standard” 
multigrid solvers, however, cannot be applied to the resulting system, as this is dominated by 
an unsymmetric part, which is due to the presence of strong convection in part of the domain. 

To overcome this difficulty, we explore the connection between Raviart-Thomas mixed methods 
and the nonconforming Crouzeix-Raviart finite element discretization. In this way we can con- 
struct nonstandard prolongation and restriction operators using easily computable weighted L 2 - 
projections based on suitable quadrature rules and the upwind effects of the discretization. The 
resulting multigrid algorithm shows very good results, even for real-world problems and for lo- 
cally refined grids. 

1. INTRODUCTION 

The exponential fitting method applied to the current continuity equations is based on a 
mixed finite element discretization using the lowest-order Raviart-Thomas triangular element 
[1]. This discretization yields a good approximation of front layers and guarantees current con- 
servation. The corresponding scheme results in a large sparse system of equations, which is dom- 
inated by an unsymmetric part. When applying multigrid algorithms to the resulting system (7), 
the most diffi cult part is the construction of suitable prolongation and restriction operators. Us- 
ing the connection between Raviart-Thomas mixed methods and the nonconforming Crouzeix- 
Raviart finite element discretization, we overcome this difficulty. 

In § 2 we give some results from [2] concerning the mixed finite element discretization. We 
determine the resulting system and show the interrelation with a nonconforming finite element 
method. §3 deals with the solution of the system of linear equations by our multigrid solver. 
First we construct easily computable L 2 — projections, based on suitable quadrature rules and 
the upwind effect of the discretization. Due to the presence of strong convection in part of the 
domain it is also necessary to consider special smoothers for the multigrid algorithm. We use a 
minimal residual method with ILU preconditioning. The results of the numerical tests are given 
in §4. 
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2. THE EXPONENTIAL FITTING METHOD FOR CURRENT CONTINUITY EQUATIONS 


2.1. Mixed Finite Element Method 

Let ft C 1R 2 be a connected, bounded and polygonal domain. H m {i 2), for m G N, and 
L 2 (Q) := H°(Q) denote the usual Sobolev and Lebesque spaces equipped with the norm 


:={£ / \D a u\ 2 }*. 


|a|<rn “ 

For / G L 2 (U) and 5 G Z, 2 (r 0 ), To C 50 closed with positive length, we consider the current 
continuity equation, as given in [3]: 

Find u G H 1 (Q) such that 

{ div (grad u + u grad ip) — f in O C IR 2 , 

u = g on T 0 C dft, ^ 

^ + = 0 onTx^aOXTo- 

The current is defined by J = grad u + u grad ip. Here, ip G jH 1 (0) is a given bounded function. 
To discretize problem (1) we introduce the classical method of changing variables from u to the 
socalled Slotboom variable p [3] 


This results in the following symmetric form of problem (1): 

Find p G H 1 (Q) such that 

( div (e'^grad p) — f in O C IR 2 , 

I p ^ x .- e i> g on r 0 c ao, 


s- 


on r x = dn \ r 0 . 


Let {7jk}fc>o be a regular sequence of decompositions of O into triangles. Denote by hk the 
longest side of all triangles T G The set of edges of 7 * is denoted by Ek, where £% are the 
boundary edges and £% — Ek \ £% 3X6 all interelement boundaries. Denote by m e the midpoint of 
an edge e oi Ek- Moreover, let P m , m > 0, be the space of all polynomials of degree not greater 
than m. Following [1], we use the lowest order Raviart-Thomas mixed finite element to discretize 
(2). Therefore we define the following set of polynomial vectors 


and set 


RT(T) :={T = (Ti,T 2 ):r 1 = a + £x,T 2 = 7 + /fy, a, /?, 7 G IR}, VT G T k , 

Vk := {t G (L 2 (0)) 2 : div r G L 2 (fi), rn = 0 on r lt t\t € RT(T) VT G Tk }, 
W k := {<p G Z?(fi) : <p\ T € P 0 (T) VT G T fc }. 
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Then the mixed finite element discretization of (2) is defined as: 


Find (Jk ,Pk) € 14 x Wk such that for all (7k , v?k) € 14 x Wk 

[ e^J k 7kdx+ f Pfcdiv 7k dx — [ x^n da. 

Jet Jet J r 0 

ipkdiv Jk dx = / f(fikdx . 

Jet 


( 3 ) 


The matrix associated with (3) is not coercive. To avoid this inconvenience we introduce a La- 
grange multiplier. We define 


V k := {t G (L 2 (Q)) 2 : r\ T G RT(T) VT G T fc }, 


and for £ G L 2 (r 0 ) 


A k,£ • — : p €: L (£ k ), p|e £ Pq (c) Ve G Ski 



s = 0 Ve C 


r 0 }. 


Instead of (3) we now consider the mixed equilibrium discretization, 

Find (J k , yOfc, Afc) G 14 x Wk x Ak >x suc/i that for all (rk,tpk, Hk) € 14 x Wk x Ak t0 


/ e 1 ^ JkTkda: 4- / pfcdiv 7kda: — Y"' / Ajt7knds= 0, 

u 'T’c'7'. — T* T'e^fc " OT 


Ter k 

Yl / <Pfcdiv j k da: 

rz-a - ‘'T 


TGT fc 


E / 

re-7-. JdT 


jJLk Jkllds 


= / fVkdx, 

Jn 

= 0. 


( 4 ) 


T€T fc 


As shown in [3], problem (4) has a unique solution and J k = J k , Pfc = Pk holds. Moreover, 
Ak is a good approximation of the solution of (2) at the interelement boundaries [2]. It is pos- 
sible to eliminate the unknowns, corresponding to J k and pk in the resulting system, by static 
condensation [3]. This yields a matrix (acting only on the interelement multiplier Ak), which is a 
symmetric positive definite matrix and which is an M-matrix if the triangulation is of the weakly 
acute type (i.e. no angle > f )• 


3 



2.2. The Nonconforming Finite Element Formulation 


To introduce the nonconforming finite element formulation we need the following definitions: 
Let II® be the L 2 — projection from L 2 (£ k ) onto 

A k '■= {Hk £ L 2 (Sk) ■ Mfcle e -fo(e) Ve G £k} 

and P® be the L 2 —projection from L 2 (tt) onto 

S k := {v k G L 2 (Q ) : v k \ T £ P 0 (T) VT € T fc }, 

i.e. n®(£)| e = J £ds, Ve G £k and P®(u)|r = j^j e 

The Crouzeix-Raviart finite element space [4] is defined by 

Sk := i v k £ L 2 (ft) : Ufclr G Pi(T) VT G 7^, u* is continuous at midpoints of edges}. 

For £ G L 2 ( r 0 ) we define 

Sjt,£ := {v/c £ S k : v k {m e ) = n£(£)| e , e C r 0 >. 

Notice that t he standard basis functions of S k are equal to one at the midpoint of exactly 
one edge and vanish at the midpoints of all other edges. Using the arguments concerning static 
condensation in [5], it is straightforward to prove the following lemma. 

Lemma 2.1. 

The solution \ k of (4) can be written as \ k = 14® (w k ), where w k is the solution of the follow- 
ing nonconforming weak problem 

Find Wk £ S k ,x such that for all v k £ S k , o 

/ /* 3 1 

grad w k grad v k dx = Pk(f) / (o “ 2 P°fe^ )^ Vkdx ' ( 5 ) 

- Ter k JT kK ' 

o 

Remark 2.2, 

For w k as in Lemma 2.1. and the solution p of (2) the following error estimate [2] holds: 

lb - w k \\o < 7lM 2 (IM| 3 + ll J ll 2 ) 

with 7 = 7 (e^) independent of p and h k - 

o 
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The Lagrange multiplier A*, is an approximation of p = u. In semiconductor simulations 
the range of ip is very large, so that A* is not suited for actual computations. Moreover we are 
interested in approximating the solution u of (1). Hence we introduce the following change of 
variable 

Vk = (nj(e*)) 1 A*; e A fc> ( n o( c v,))-i x . (6) 

Denote the standard basis of by tp e , e G 8k- We define the linear operator Ek : Sk —> Sk by 

Mf') = Ve € 

For / G L 2 (Q), Gk(f) G L 2 (f2) is defined by 

cm = nu)(\ - 


Finally we arrive at the following statement: 

Lemma 2.3. 

Let C = (n°(e^)) -1 x G L 2 (r 0 ). Then p k of (6) can be written as p.k = n°(u*), where Uk is 
the solution of the nonconforming weak problem: 

Find Uk G Skx suc h that for all Vk G Sk, o 

T. f grad E k {uk) grad Vkdx = [ G k (f) v k dx. ( 7 \ 

Tex*, ^ T Jsi ' ' 

o 


Remark 2.4. 

Note that problem (7) is the usual nonconforming Crouzeix-Raviart discretization of the 
Laplace equation, if ip and / are constant on fi. 


We can use the error estimate of Remark 2.2. to obtain an estimate for the approximation 
error between the solution Uk from (7) and the solution u of (1), though the result is rather un- 
satisfying. To arrive at an improved error bound, one could use the fact that two Babuska- 
Brezzi conditions hold [6] for the corresponding bilinear form. The stability and the unique solv- 
ability of the discrete problem (7) also follow. In the following we construct a multigrid algo- 
rithm for problem (7). Therefore we define the bilinear form a* on Sk by 

ak(uk,v k ) := ^2 (Fjfc (e^)) -1 f grad E k (u k ) grad v k dx. 

TcTi. JT 
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3. MULTIGRID METHOD 


3.1. Adaptive Mesh-Refinement Techniques 

In order to formulate the multigrid algorithm, we need a regular sequence of triangulations 
{%}k > o In our refinement process, two objectives are pursued. First, m order to improve ap- 
proximation, we should refine the grid locally, where the solution behaves very badly. Second, we 
have to construct weakly acute triangulations to guarantee that the corresponding discretization 
matrix is an M-matrix. Therefore we define the strategy and rules below. Given a triangulation 
we refine its triangles as follows: 

(1) The refinement process is started by a suitable error estimator, e.g. based on residuals, 
which marks some of the triangles as red. 


(2) If a triangle is marked 


(i) red , it will be cut into four new ones by joining the midpoints of its edges, 

(ii) green, it will be cut into two new triangles by joining the midpoint of the longest edge 
to the vertex opposite to this edge, and 


(iii) 


blue, it will be cut into three new triangles by joining the midpoint of its longest edge to 
the vertex opposite to this edge and to the midpoint of one of the remaining edges (see 
Fig.l) 



Figure 1. Red, 



(3) Hanging nodes are avoided using the following rules: 

(i) a triangle with three hanging nodes is marked red 

(ii) a triangle with two hanging nodes is marked blue , if one of the nodes lies on the longest 
edge of the triangle; otherwise it is marked red 

(iii) a triangle with one hanging node is marked green, if the node lies on the longest edge of 
the triangle; otherwise it is marked blue 

Note that rules (ii) and (iii) may introduce new hanging nodes. However, one can prove that 
the refinement process obeying the above rules is finite. Moreover, assuming that T 0 has only 
isosceles right-angled triangles, then it is guaranteed that all triangulations T k are weakly acute. 


6 


3.2. The Prolongation 


In order to solve problem (1), we have to find the solution u * of the discrete problem (7). 
Since the Crouzeix-Raviart element is nonconforming and S k -i <£ S k , we must construct a suit- 
able transfer operator between Sk - 1 and Sk- In addition, the discretization shows upwind effects 
due to the existence of strong convection in part of the domain. This also must be taken into ac- 
count. 

In [7, 8] a hierarchical basis multigrid method was used to solve a linear system arising from 
the convection diffusion equation by an upwind discretization. It was shown that the convergence 
of the hierarchical basis multigrid method depends on the strength of the convection term. When 
solving the discrete problem (7) with the multigrid algorithm [9], a similar effect can be seen in 
the numerical experiments. On the other hand, considering the one dimensional problem, one 
sees that a good interpolation has to regard the upwind effect. Therefore we introduce the fol- 
lowing weighted £ 2 — projection. Define 

{u,v) k := 53 (^(^jlr)" 1 53 [ E k(u) vdx V u G S k , v e S k U S k +i- (8) 

T€T fc r&T k+1 Jt 

T CT 

For all u G P-z(T), T <G T k , the quadrature rule 



udx = -U- ^3 w(m e ) 


ec dT 


is exact, so that (8) can be written as 


(u,v) k = ^Z{Pk{^)\f) 1 53 53 E k(u)(m e ) v(m e ) (9) 

feT k 7 , eT fc+1 ec dT 

TcT 


for all u € Sk and v G S k U S k +i- 

Remark 3.1. 

Note that if v E S k holds, (9) reduces to the equation 

(u,v) k = 53 (*Tfe( e ^)l r) -1 ® 53 n 2( eV, )le u(m e ) v(m e ). 

TeTfc ec dT 

Moreover, if -0 is constant, we have 


(u,v) k = (u,v) V ue S k , v e S k uSfc+i, 
where ( u,v ) := J n uv dx denotes the usual L 2 — inner product. 


o 
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From (9) it follows that the standard basis functions of Sk axe mutually orthogonal with re- 
spect to the inner product (., .)*. Therefore we can obtain an easily computable prolongation op- 
erator P%_i : Sk - 1 — * Sk by 


{Pk_iUk-i,vk)k = {uk-i,vk)k-i V uk~ 1 g Sk- i, vk e Sk- 


it is straightforward to prove the following lemma: 


Lemma 3.2. 


Let Ufc_ i G Sk- 1 ) then: 


If e G then (-f > ^_jUfc_i)(77i e ) ( 

where T (resp. f) is the triangle in 


n°(e*)| e ,-i(g fc -iTx fc -i)(m e )l f 
P^)\t } PS-i&Tf ’ 

Tk (resp. Tk-i) with e C dT (resp. e C dT). 


If e G S^. then 
{Pk—l u k—l){ m e) = (1^ I 


, , T «, ng(e»)|. 
JTf(e*)|ri. 1 

/ \rjiL | (-^fc-l M <:-l)( m e)lT' z - 

U ' n°-l(^)lf L 


where T L ,T R (resp. f L ,T R ) are the two triangles in 
= e and T L C T L , T R C T R . (see Fig.2) 


r 1 

imRi (^fc— l' tJ fc-l)( 7Tl e)|f'fl \ 

+ l ' Pg-iie*) \f» h 
T k (resp. Tk- 1 ) with T L D T R 


o 


Remark 3.3. 


If ip of Lemma 3.2 is constant, we have the usual L 2 -projection as given in [9]. The 
coefficients k > 0, are also computed during the construction of the stiffness matrix, 

hence the interpolation is not very expensive. On the other hand, as shown in [5] the coefficients 
k > 0 introduce an upwind effect; i.e. the coefficient corresponding to the downwind 

P^)\ T ' ~ 

node is equal to zero. 
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Figure 2. Interpolation. 


3.3. The Smoother 


A suitable smoother for the system (7) is given in [10] by a Gauss-Seidel-iteration with de- 
coupling. This smoother is confined to special triangulations and does not allow adaptive grid 
refinements. Another candidate for problems with strong convection terms is the ILU-iteration. 
Here we restrict ourselves to a variant of the ILU-iteration. The ILU-decomposition of the linear 
system A k , related to problem (7) and the standard basis of the Crouzeix-Raviart finite element 
space Sk£, can be written as 


A k = L k U k — D k , 


where L k ,U k and D k are given by the sparsity pattern of A k . Denote by a k = (a e ) e€ £ k the coeffi- 
cient vector of u k = Xleef* G S k ^ and by b k the right hand side. Then the ILU-iteration is 

given by: 


a k an arbitrary starting vector, w E (0, 1] , 

4 = 4" 1 + u{L k U k )-\b k - Akair 1 ), Vi = !,•••• 


In order to get a good smoothing rate, we must optimize the factor 


HitW - a ;)ik 

||A t (aJ-aJ)|| 2 ’ 


as mentioned in [11], Here a* k is the solution of A k a k = b k . Therefore, by computing the optimal 
damping parameter ui in every step, our final smoothing algorithm is 
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Algorithm 3.4. 


ctf. an arbitrary starting vector, rjj! = b k — 
for i = 1, • • • , compute: 

4 _1 = (L k u k r\\-\ 

t-r 1 = 


a k — a k 1 + lrf l \ 

4 = 4 -1 - w 4-1 ®* -1 . 

end. 


o 


Remark 3 .5. 

Algorithm 3.4. can be interpreted as a minimal residual method with ILU-preconditioning. 

o 

3.4. Multigrid Algorithm 

Now we are in the position to formulate our multigrid algorithm. 

Algorithm 3.6. (One MG-iteration at level k) 

(1) Pre- smoothing: Given u k = ^2 ee g k a^e £ £*,<• For i — 1, • • • ,i/i compute u\, using Algo- 
rithm 3.4. 

(2) Coarse-Grid Correction: Denote by ul_ 1 € S k - i,o the solution of the coarse grid problem 

ajfc-i(«fc-i,Vfe_i) = (G k (f), Ik-iVk-i) - V v k-i £ Sfc-1,0- (*) 

If k = 1, set = u^_j. If k > 1, compute an approximation u k -i to u* k _ x by applying 

fx — 1 or n = 2 iterations of the algorithm at level k — 1 to problem (*) and starting value 0. 
Set 

?/" 1 + 1 — n Vl - 1 - P k i'l. , 

U k U k ' > k -l U k-l- 


(3) Post-smoothing: Apply v 2 iterations of Algorithm 3.4. to u£ +1 . 

o 
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Remark 3.7. 

So far, there exists no convergence proof for Algorithm 3.6. The standard convergence analy- 
sis, as in [9, 11], cannot be used here, because the bilinear form ajt(., .) is unsymmetric. 


o 

4. NUMERICAL RESULTS 


In this section we present three numerical examples which demonstrate the behaviour of the 
proposed multigrid method. In all experiments we measure the performance of a method by the 
arithmetic mean of the convergence rates 


Pi = 


i T i 

T l T l 

’ 


where r\ is the defect of the i—th iteration. 

The first model problem is taken from the papers of Brezzi, Marini and Pietra [3, 5]. We 
consider the domain U := (0, 1) x (0, 1) with Neumann boundary 

Tx := {(a;,t/) : ((* = 1) A {y < 0.75)) V (( y = 1) A (x < 0.75))} 


and Dirichlet boundary To := 9f2\ri, right hand side / = 0 and potential ip defined as ip{x,y) := 

fpo{x,y) 


l 


with 


1>o (x,y) 


0.0 if 0.0 < r < 0.8 

r — 0.8 if 0.8 < r < 0.9 with r := yjjx — l) 2 + (y — l) 2 . 

0.1 if 0.9 < r 


On Tq we have g(x,y ) = 0ifx = 0ory = 0 and g(x,y) = 1 otherwise. We use the initial 
triangulation Tq as given in Fig.3. and refine every triangulation by marking all triangles as red 
(uniform refinement). The numerical solution for l — 10 6 and a locally refined grid is shown in 
Fig-4. 



Figure 3. Initial triangulation 1. 


Figure 4. Numerical Solution. 
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We test our multigrid algorithm 3.6. with two pre- and two post-smoothing steps (i^i — 1^2 — 
2) and with different values of n (only smoothing : n = 0; V-cycle : H = 1; W-cycle : p = 2) 
for problems with varying k max (k max - 1, — ,5). The corresponding convergence rates for l = 
10 and l = 10 6 are given in Tab.l and Tab.2 respectively. In all experiments we used the same 
arbitrary starting vector. 


^max 

1 

2 

3 

4 

5 

n = 0 

.672 

.854 

.886 

.904 

.910 

^ = 1 

.032 

.103 

.159 

.208 

.253 

y, = 2 

.032 

.074 

.059 

.059 

.055 


Table 1. Convergence rates (l = 10) 


^max 

1 

2 

3 

4 

5 

/i = 0 

.736 

.879 

.890 

.906 

.910 

n=l 

.096 

.245 

.358 

.427 

.482 

y. — 2 

.096 

.221 

.266 

.235 

.201 


Table 2. Convergence rates (l = 10 6 ) 


In Tab.3 we show the results for km ax = 5 and with varying fi and l = 10 m . 


m 

0 

1 

2 

3 

4 

5 

6 

y, — 0 

.907 

.910 

.908 

.906 

.908 

.910 

.908 

M= 1 

.227 

.253 

.444 

.473 

.483 

.482 

.482 

y. = 2 

.053 

.055 

.075 

.174 

.198 

.201 

.201 


Table 3. Convergence rates (km ax 5) 

In the second experiment we take 

e — 0(x,y) 

^ ^ (cosh(100 (r — 0.65))) 2 ’ 

with i>(x,y) = 10 3 (1 + tanh(100 (r - 0.65))) and r = y/Jx - l) 2 + (y - l) 2 . Again we chose 
ft = (0,1) x (0,1). The Dirichlet boundary T 0 = dfl and g(x,y ) = (x + y) e~^ x ’ y \ The exaot 
solution is given by 

u(x,y) = (x + y) 

The numerical solution is shown in Fig.7. We used three different coarse grids, as given in 
Fig.3, Fig.5 and Fig.6, to show that the Algorithm 3.6. does not depend on the orientation of the 
grid. For uniform refinement and Lax = 5 Tab.4 shows the results with varying n (fi = 0, 1, 2). 
In Tab.4 we also show the results for fc max = 6 and adaptive refinement of the grid (see Fig.8). 
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Figure 5. Initial triangulation 2. 


Figure 6. Initial triangulation 3. 



Figure 7. Numerical solution. Figure 8. Adaptive refined grid ( k = 4). 


grid 

init. triang. 1 

init. triang. 2 

init. triang. 3 

loc. ref. 

\x = 0 

.900 

.903 

.891 

.905 

H = 1 

.311 

.225 

.216 

.409 

^ = 2 

.157 

.087 

.128 

.355 


Table 4. Convergence rates 

Finally we consider an experiment with a real-world problem. Fig.9 shows the schematic 
structure of the doping of a thyristor. With an existing simulation program (ABBPISCES) we 
computed the solution u of (1) and the potential ip of the coupled stationary semiconductor 
equations for a blocking-state (see Fig. 11 resp. Fig. 12) and an on-state of the thyristor (see 
Fig.13 resp. Fig.14). The so computed potential ip was substituted into equation (1) and the re- 
sulting system was solved with our multigrid algorithm. Fig. 10 shows the grid for an adaptive 
refinement ( k = 5). Finally Tab.5 shows the convergence rates for Algorithm 3.6. with a suitable 
number of pre- and post-smoothing steps, with varying ^ (// = 0, 1,2) and k mBLX = 7. 


-< 

p 

rr 

p 

J 

p- 

— < 

:rt + 

i i 



Jj 


Figure 9. Schematic structure of the doping. 



Figure 10. Adaptive grid ( k = 5). 





Figure 13. Solution (log). Figure 14. Potential. 


state 

blocking 

on 

p = 0 

.843 

.828 

p = 1 (z/i = V2 = 22) 

.249 

.112 

p = 2 (z/i = vi — 9) 

.108 

.121 


Table 5. Convergence rates 
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Abstract 

Grid staggering is a well known remedy for the problem of velocity/ 
pressure coupling in incompressible flow calculations. Numerous incon- 
veniences occur, however, when staggered grids are implemented, partic- 
ularly when a general-purpose code, capable of handling irregular three- 
dimensional domains, is sought. In several non-staggered grid numerical 
procedures proposed in the literature, the velocity/pressure coupling is 
achieved by either pressure or velocity (momentum) averaging. This ap- 
proach is not convenient for simultaneous (block) solvers that are preferred 
when using multigrid methods. A new method is introduced in this pa- 
per that is based upon non-staggered grid formulation with a set of virtual 
cell face velocities used for pressure/velocity coupling. Instead of pressure 
or velocity averaging, a momentum balance at the cell face is used as a 
link between the momentum and mass balance constraints. The numeri- 
cal stencil is limited to 9 nodes (in 2D) or 27 nodes (in 3D) both during 
the smoothing and inter-grid transfer, which is a convenient feature when 
a block point solver is applied. The results for a lid-driven cavity and a 
cube in a lid-driven cavity are presented and compared to staggered grid 
calculations using the same multigrid algorithm. The method is shown to 
be stable and produce a smooth (wiggle-free) pressure field. 

1 Ph.D. Student 
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1 Introduction 


Multigrid methods are used in a number of applications in fluid dynamics, 
usually by applying the Pull Approximation Scheme [1]. Incompressible flow 
calculations usually employ a staggered grid because of its strong coupling 
between the pressure and the velocity field (e.g. [2]. For complex geometries, 
however, as well as for calculations in non-orthogonal coordinates, the use 
of a staggered grid is a seriou s obstacl e to e fficient and well structured 
computer coding [3]. A ddition al complexities arise when a block-solver is 
used; for example, variables cannot be easily grouped into cell-bound blocks 
due to different node count. Some authors resort to asymmetric nodal 
clusters [5] while others update a symmetric block of variables around the 
cell centre node thereby updating face velocities twice in each relaxation 
sweep [5, 6]. Various levels of decoupled relaxation are also common. These 
include distributive relaxation, where all momentum equations are solved 
together and the pressure field is solved separately [1, 7], and sequential 
schemes that update variables throughout the flow field one by one [8, 9]. 
Some comparative studies of block versus sequential relaxation give no clear 
preference [10, 11]. There is a greater consensus that grid staggering is a 
necessary burden, particul arly in the context of multigrid methods ([12, 13, 
14, 15, 16] and even [1]). Comparison studies of staggered and non-staggered 
methods are sometimes conflicting in their assessment of the accuracy and 
stability of any given method. While some authors demonstrate that non- 
staggered methods match the staggered ones using both criteria ([13, 16, 
17]), others question it ([18]). Despite this, the majority of finite volume 
incompressible calculations use staggered grids. The main reason may be 
that existing non-staggered grids increase rather than lessen the complexity 
of the staggered grid calculations. For example, the method of Rhie and 
Chow [19] (adopted by [13, 14, 15]) requires that both the nodal and 
cell face velocities are stored. Moreover, in a m ult i g rid context, both the 
nodal and the face velocities need to be restricted [8], req uirin g even more 
computational work. Also, the computatio nal c luster extends be yon d either 
9 or 27 point stencil in two- or three-dimensional formulations respectively 
for the first order discretisation and even more if the higher order methods 
are used. 

The considerations mentioned above motivated the present contribution 
for a method suitable for block solvers on an irregular three-dimensional 
domain using a non-staggered grid. 


In this paper a brief description of the multigrid procedure based on 
a new non-staggered method is given in section 2 and the multigrid 
implementation in section 3; the test cases and results are presented in 
section 4, followed by the discussion section where relative merits of the 
method are assessed. 


2 The new non-staggered method 


A transport equation for a general set of transported variables u in the 
volume Q bounded by a boundary S can be expressed in an integral form 
suitable for finite volume formulation 


d_ 

dt 



r u 



riidS — F u , 


( 1 ) 


where p is the density, u % is the velocity component in the Xj direction and 
rij is the component of unit normal to the boundary S. When (1) is applied 
to the momentum balance of a viscous incompressible fluid, the set u is a 
velocity vector u = {uj,j = 1, d} (d being the problem dimension), with 
the corresponding diffusion coefficient T = p, and the source terms 



( 2 ) 


in the absence of external volume forces. The extension to other trans- 
portable properties (such as enthalpy, mass fraction, etc.) is straightforward 
by augmenting the vector u to include new variables and defining appropri- 
ate source terms and the diffusion coefficients. In the following presentation 
a three-dimensional implementation will be used. 

The momentum equations are discretised using the hybrid (central/up- 
wind) difference scheme [20] although higher order schemes can be 
employed 1 . The pressure field is resolved by means of mass conservation for 
the control volume around the node in a symmetric block manner as used 
by Vanka [5] for the staggered grid, although the extension to the line block 
around the node in a symmetric block manner as used by Vanka [5] for 
the staggered grid, although the extension to the fine block formulation is 

! For a multigrid implementation of a second order upwind scheme on a staggered grid 
see e.g. [21]. 
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straightforward. The estimation of the face velocities that are substituted 
into the mass conservation equation is obtained by discretizing the mo- 
mentum equation over a half-length volume around each cell face, directly 
involving the nodal pressure and velocities, while the lateral velocities are 
obtained by averaging values from the nearby nodes (see Fig. 1). More 
details on the coefficient generation are given in [22]. 

The principle of discretizing the face velocity using a half-size cell is 
applied also by Schneider and Raw [3], although in their method the 
coefficients of the face velocity are treated implicitly by incorporating them 
into the nodal velocity coefficie nt matrix . To ensur e positi ve definiteness of 
the nodal velocities coefficient matrix, Schneide r and Raw had to truncate 
the momentum equation applied to the face velocities [3, 16]. In the 
present method, the face velocities are explicitly expressed in terms of 
the surrounding nodal values and used in the continuity equation for 
the pressure correction calculation. The implication of this step is that 
the family of face velocities satisfies both the momentum and the mass 
conservation exactly at the positions where the convection velocities in 
a general transport equation are required. On the other hand, as an 
average of the (tentative) nodal velocities they axe readily available without 

requirement for a permanent storage. 

The boundaries of the flow field are coincident with the cell faces, 
enabling the definition of a set of boundary nodes there. This practice 
ensures consistency between the global mass balance of the whole calculation 
domain and the local mass balance of each cell, but calls for special 
treatment if Neumann boundary conditions are to be used. If the zero- 
gradient condition for the normal velocity is discretised in a usual way 

d{uiTii) _ (nj^i)i> — ( u t n i)inn 

d(xiTli) ~ ( XiUi) b - (xiTli)i nn ' 

where subscripts b and inn denote boundary and the first inner node, 
respectively, the flow rate through the boundary will be linked to the 
velocity that does not belong to the mass preserving field, resulting in 
poor overall mass conservation. The correct way to implement Neumann 
boundary condition in this case is to use the face velocity. This way, the 
local and global mass balance become fully compatible. There is no need 
for any special treatment of the Dirichlet boundary conditions where the 
face velocities coincide with the boundary and are assumed known. 
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3 The multi-grid implementation 

In the multigrid context, the nonlinear equation (1) can be expressed as 

£(u) = F (4) 

by grouping all terms that will result in a coefficient matrix (within a 
Newton iteration cycle) into the operator C and the remainder into the 
source term F as in [13, 23] . A more common practice of expressing Eqn. (4) 
as homogeneous (by absorbing F into £(u)) [1, 24] is found by the present 
authors to be somewhat confusing, especially when defining residual transfer 
to the coarse grid. 

The discretised (sparse, positive definite) Eqn. 4 for the grid l is linearised 
by a Newton iteration [24] 

L l u l = F l (5) 

and relaxed by a block Gauss-Seidel method. 

The updates of the variable set 

u' = a(diag(L)) -1 R* (6) 

are expressed in terms of the residual R/ = F* — l) u*, the inverse of the 

coefficient matrix diagonal (diag(L)) -1 and the underrelaxation coefficient 
a. Variables at the node i,j,k are then updated by Uj 

Restriction is accomplished by grouping a cluster o^ eight cells into one. 
This leads to the following operator 

4>I,J,K = ^(4>i,j,k + 4>i+l,j,k + + <t>i,j,k+\+ 

^i+lj+l.fc + 0ij+l,fc+l + < Ai+l,j,k+l + <£»+ l,j+l,fc-fl). ( 7 ) 

where I = 2i — 2, . . . The same operator is applied both to variable and 
residual restriction. After both the variables and residuals are transferred 
to the next coarser grid ( l — 1), Eqn. (5) is approximated as 

L l ~ l { u ,_1 )=F* -1 , (8) 

where 

F 1-1 = F 1 - 1 - (Fq _1 - Z/q -1 Uq - 1 ) + 7£j _1 R* (9) 


is the equivalent source term on the coarse grid. The restriction at 
Neumann boundaries is carried out using a divided form of the boundary 
conditions [1]. For the velocity component perpendicular to the boundary, 
an additional correction is made to preserve the mass flow rate through the 
boundary. 

Prolongation is carried out by tri-lineax interpolation using a seven point 
stencil, shown here for one cell and with injection only: 

<f>ij,k = + <f>I,J,K-l ) (10) 

with i = (/ + 2)/2, . . . The injection upon the first visit to the fine grid 
(FMG cycling is assumed) and the fine grid correction are done as 

u first = P /-l U i-l or u new = u old + 7? /-l( u <-l- u O,i-l)- ( U ) 

4 Test cases 

The non-staggered method presented in this paper is compared with 
the staggered three-dimensional calculations employing the block symmetric 
Gauss-Seidel algorithm of Vanka [5]. For both methods the coding and data 
structures are of the same style. 

The flow in a three-dimensional cavity with a moving top is used as a 
first test case. The residual norm history is shown in Fig. 2. The rate of 
convergence obtained when calculating on a staggered grid is comparable 
with the results of Vanka [10] where 12 work units (w.u.) were necessary for 
a two orders of magnitude residual reduction. In our calculations 14 w.u. 
was necessary for the staggered grid calculation and 18 w.u. for the non- 
staggered calculations. 1 However, the change in slope of the non-staggered 
residual may indicate that the full potential of multigrid acceleration has 
yet to be achieved. 

In a second test case, a cube is inserted in a cavity (Fig. 3), forcing the 
flow to negotiate this asymmetric three-dimensional obstacle, partly by the 
velocity magnitude change, partly by flow separation. It is believed that 
this flow geometry serves as a good test of the pressure/ velocity coupling 

J Or 23 w.u. for the same residual decrease; however, this is more arbitrary, because 
of the much lower initial residual at the finest grid. 
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because the major force behind the flow adjustment is the pressure field. 
The residual history, (Fig. 4) indicates very similar convergence rates for 
the staggered and non-staggered calculations. The resulting flow field in a 
symmetry plane (Fig. 5) indicates well resolved separation bubbles around 
the cube corners. 

5 Discussion and conclusions 

The new method of incompressible flow calculation using non-staggered 
grids and its multigrid implementation are examined for suitability in a 
complex flow field geometry. The presence of two sets of velocity values, 
both of which satisfy the (discrete form of the) governing equations increases 
the overall level of accuracy for a given grid size, although this remains to 
be quantified. 

In the numerical experiments performed so far the method proved to be 
stable, without any tendency to produce an oscillating pressure field, which 
is a common feature of some non-staggered methods [18]. The method 
permits discretization on a trivially coarse grid (with a single node in the 
interior), which is very convenient in a FAS multigrid implementation, 
because it allows the coarse grid to contain the lowest number of nodes. 
Thus significantly coarser grids can be used in complex geometries. For 
example, in the case of a cube in a cavity (see the previous section) the 
coarsest grid (6x6x6 nodes) has only one control volume located between 
the cube and the cavity wall at one side. If the calculation method required 
two nodes at minimum, the overall node count at the coarse grid would 
increase eight times, thereby substantially increasing the work needed to 
obtain exact solution at the coarsest grid. 

Various tests performed so far always produced smooth solutions both 
in velocity and pressure, which indicate a high ellipticity measure of the 
proposed method. The analytical evaluation of the ellipticity measure 
remains to be carried out (following e.g. [25, 16]). 

The amount of computational work of the proposed method is slightly 
larger that of the Rhie and Chow [19] method. It is comparable to 
the work in the SCGS method of Vanka, requiring the same amount of 
work to calculate face velocities and pressure coefficients and, in addition, 
the calculation of the nodal velocity coefficients, i.e. approximately 25% 
increase in two-dimensional and 14% in three-dimensional calculations. This 


overhead exists only for the simplest flow problem because any additional 
variable that is solved permits nodal velocity coefiicients to be reused (with 
proper scaling of the diffusion part). 
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Figure 1 : The layout of a non-staggered grid. Only the nodal variables 
require storage. 


3D Cavity Flow 
5 grids: 2x2x2 ... 32x32x32 





Figure 3: The grid for a flow around the cube in a lid-driven cavity. 


Flow around the cube in a cavity 
4 grids: 6x6x6 ... 48x48x48 



— " Staggered grid ■+■ Non-Staggered grid 


Figure Mass residual history for the flow around the cube in a lid-driven 
cavity. Cavity Re=400. 
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ABSTRACT 


We consider the Helmholtz equation with a discontinuous complex parameter and 
inhomogeneous Dirichlet boundary conditions in a rectangular domain. A variant of the direct 
method of cyclic reduction is employed to facilitate the design of improved multigrid components, 
resulting in the method of CR-MG. We demonstrate the improved convergence properties of this 
method. 


1 INTRODUCTION 


Microwave heating of foods has revolutionised the food processing industry. Effective and 
efficient microwave heating depends very much on a detailed knowledge and understanding of the 
dielectric properties of the food to be processed. This need has given rise to extensive research into 
the dielectric properties of materials; see, for example, Tinga and Nelson [1]. 

Microwave heating can be compared to heating by alternating current. The electric field of 
alternating current changes direction approximately 100 times each second, whereas the microwave 
field changes direction approximately 5 billion times each second. The heating effect is 
accomplished by energy transfer to dipoles, most commonly water. Most foods contain between 50 
and 90 percent water. By attempting to follow the very rapidly changing microwave electric field, 
the molecular vibrations of the dipoles are strengthened, thus increasing the temperature of the 
water and hence the food. 


The scalar potential cj) associated with the microwave field satisfies the wave equation 


v V- £ ^=°, 


( 1 ) 


which is derived from Maxwell’s equations of electromagnetics. The parameter e describes the 
permittivity of the medium and the parameter /* the permeability. However, the radiation field in 
a microwave oven varies harmonically in time, and so we look for a solution of equation (1) in the 
form 


4>(x , t) = e ,u,t tt(x), 
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waveguide 

Figure 1: A two-dimensional model of a microwave oven. 

where u is a time-independent scalar potential function and u is the frequency of the microwave 
radiation. By substituting this expression into equation (1), we see that u satisfies the Helmholtz 
equation 

V 2 u + 8 u = 0, 

where 8 := efioj 2 . In general, e and g are complex numbers, with real parts related to a material’s 
ability to store electrical and magnetic energy respectively, and imaginary parts related to a 
material’s ability to dissipate electrical and magnetic energy respectively. However, the ^ 
permeability of biological materials is close to that of free space, i.e. g ~ go = 47T x 10 Hm . 
Hence, since most domestic microwave ovens operate at a frequency of 2450 MHz, we can calculate 
8 for any given permittivity e. 

The oven is represented schematically (in two dimensions) by the rectangular domain depicted 
in Fig. 1. Region 1 corresponds to free space and so e = e 0 = x 10 -9 Fm 1 and 8 is a real 
constant in this region. Region 2 corresponds to the heated material and so 8 is a complex constant 
in this region. Energy is fed into the system by a magnetron via the waveguide. Hence, in this 
paper, we consider the solution of the Helmholtz equation with a discontinuous complex parameter 
and inhomogeneous Dirichlet boundary conditions in a rectangular domain. 

We close this section with a plan of the paper. In section 2 we describe the mathematical 
problem and discuss the smoothing abilities of two multigrid smoothers. In section 3 we describe 
the technique of approximate cyclic reduction and show how this can be used to design improved 
multigrid components. Numerical results are presented in section 4 and some concluding remarks 
are made in section 5. 


2 MATHEMATICAL PROBLEM 

Consider the complex two-dimensional Dirichlet boundary value problem 


V 2 tf + 8 u = 0 in fl = Hi U H 2 

(2a) 

s.t. u = g on dQ,, 

with data . _ 

j 81 in subdomain 

(2b) 

^ | 82 in subdomain CI 2 ’ 

where fli and 0^ are rectangular subdomains of fl (as in Fig. 1). 
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Operator Definitions 


Before attempting to solve this problem by the multigrid method, we need to carefully consider 
the definitions of the discretisation, restriction and prolongation operators. In [2], De Zeeuw 
considers the solution of general linear second order elliptic partial differential equations over 
similar domains. He notes that the rate of convergence of standard multigrid methods often 
deteriorates when the coefficients in the differential equation are discontinuous; he proposes 
matrix-dependent grid transfer operators to overcome these difficulties. However, in our case, the 
discontinuity occurs only in the coefficient of u (viz. 8 ), and not of Vu. Hence we proceed in the 
following way to define operator R — R(8') in the domain fl, where R can be taken to represent the 
discretisation, restriction or prolongation operator. Firstly, if 8 takes value 6, in subdomain H, 

(i = 1,2), then we set the value of 8 on the interior boundary between fli and H 2 to 
63 : = 1(6, 4. 82). Secondly, V is defined piecewise by 


R 


R(8\) in 
< V(8 2 ) in H 2 ■ 
k R(8 3 ) on dR 2 


( 3 ) 


In practice, this definition of R, for discontinuous 8 , does not seem to impair the convergence of 
the multigrid algorithm for relevant values of 8. 


Equivalent System of Real Equations 


Consider the discrete analogue of problem (2). Suppose 8 = a + i/3 G C and g € 1R. Using a 
central difference discretisation on a mesh of n x n intervals, the matrix of the discrete system 
A\i = f is represented in stencil notation by 




_ 1 _ 

h 2 


1 

1 p 
1 



( 4 ) 


where A € C (n_1)2 x C (n-1)2 , h : = £ and p := 8h 2 - 4 = (ah 2 - 4) + ifih 2 . Hence, while most linear 
systems which arise in practice have real coefficient matrices, the discretisation of this problem 
yields a complex linear system. Further applications which give rise to complex linear systems 
include discretisations of the time-dependent Schrodinger equation, inverse scattering problems and 
underwater acoustics. 


A popular approach for solving complex linear systems is to solve the equivalent real linear 
systems for the real and imaginary parts of u. However, the following remarks, due to Freund [3], 
cast doubt on this approach. Suppose that A is a general complex m x m matrix. By taking real 
and imaginary parts, we can rewrite the complex system as the real linear 2tti x 2m system 



'Ref \ 

2m f J ' 


B u = 


Re A 2m A 
2m A —Re A 
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It can then be shown that B has eigenspectrum 


(t(jB) — {A e C| A 2 € cr(AA)}, 

which means that cr{B) is symmetric with respect to the real and imaginary axes and hence the 
eigenvalues always embrace the origin. Now if A is complex symmetric (as is the case with (4)), 
then B is a real symmetric matrix with real eigenvalues symmetrically distributed about the origin, 
i.e. B is symmetric indefinite. Therefore the equivalent real system is often harder to solve than the 
original complex one. 


Smoothing Analysis 


Multigrid smoothing methods are usually basic iterative methods, the properties of which are 
well understood. As the name suggests, the function of a multigrid smoothing method is to reduce 
the rough (high frequency) components of the error as efficiently as possible. This is basically a 
local task and so the smoothing efficiency of a method can be analysed by local Fourier mode 
analysis [4], neglecting interactions with boundaries. The smooth (low frequency) error components 
are reduced on the coarser grids. There is a natural distinction between high and low frequencies 
depending on the type of grid coarsening chosen. Essentially, the low frequencies are those which 
are visible on the coarser grids. In principle, smoothing methods need not be convergent (see [5], 
chpt 7), although in practice most are. 

Consider the discrete analogue of problem (1), Au = f, defined on a mesh ofnxn intervals. 
Basic iterative methods are based on a matrix splitting A = M — N and are defined by 

Mu (m+1) = Nu {m) + f. 

The algebraic error arising from the iterative solution of this system of equations is defined by 
e (m) u (m) _ u an( j satisfies the equation Me^ m+ ^ = Ne^ m f Denoting the stencils of A/ and N 
by [M] and [TV] respectively, this equation can be rewritten in stencil form as [M] = [iV] e^. 

Now if we define e^ m+1 ^ := Aef"d and note that the algebraic error can be represented as a 
combination of local Fourier modes 

= \m {0 _ 0 € 0 ;= { (2je , 2a) : _! + 1 < P, q < a}, 

then by substituting this into the stencil representation of the error recurrence we define the error 
amplification factor 

— [ W ] e , (,»+«)• 

The error amplification factor is the factor by which the amplitude of the ( 0 , <j>) Fourier mode is 
multiplied as a result of a single smoothing iteration. Now in the case of standard grid coarsening, 
the sets of smooth and rough frequencies are defined by 

0, :=e n (-f , f) 2 , 

Or := e\o.. 
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PGS KACZ 



Figure 2: Fourier smoothing factors p D for PGS and KACZ. 


Hence the Fourier smoothing factor , which is the worst factor by which all high frequency error 
components are reduced per iteration, is defined by 

p := max |A(0 , <i)| . 

(0,<6)ee r 

Note however that this definition of the smoothing factor is only valid for boundary conditions of 
harmonic type. The influence of Dirichlet boundary conditions can be taken into account 
heuristically (see [6] and [7], for example) in the following way. The error at the boundary is 
always zero and so we define a new set of rough frequencies as 

Of := 0 r fl {(0 , <f) € 0 : 0 7^ 0 and/or <j> / 0}. 

The corresponding Fourier smoothing factor is defined by 

p D := max |A(0 , <f)\ . 

( 0 , 0 ) 60 ? 

This is a mesh-dependent definition. A mesh-independent definition, introduced by Brandt [4], is 
obtained by replacing the discrete set 0 with a continuous analogue, but this is more difficult to 
compute numerically and gives less realistic results in cases where the type of boundary condition 
has much influence. 

There are many possibilities for the choice of smoothing method (see [7], for example), but for 
brevity we consider only two, point Gaufi-Seidel iteration (PGS) and Kaczmarz iteration (KACZ). 
The latter of these two methods, dating back to 1937 [8], is considered here because, when applied 
to the complex linear system Au = f, the method converges for all distributions o(A ) of eigenvalues 
of A. The reason for this is that solving the system Au = f using KACZ is equivalent to solving the 
system AA H \ = f with u = A ff v (i.e. postconditioning) using PGS, and the matrix AA H is 
Hermitian positive definite, thus guaranteeing convergence. Applying the smoothing analysis to 
stencil (4), the error amplification factors for PGS and KACZ are 

. e iB + e* 

^PGS " 7/5 I 

p + e~ t9 + e~ x< p 
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Akacz = 


(e ,(? + e i4l ){e ie + e* + p + p) + 2e , ' ( ~* + * ) 

4 + pp + (e" 1 ' 5 + e -, '*)(e- iff + e“’^ + P + p) + ’ 

for some p = (a/i 2 - 4) + ifih 2 and (0 , <f>) € 0. Fig. 2 displays contour plots of p£ GS and pg ACZ 
plotted as functions of ah 2 and (3h 2 . For fixed values of /i and a = as /? = Xm <5 increases, 

Ppgs increases and Pkacz decreases. Hence we might expect the multigrid convergence rate to 
improve slowly with a KACZ smoother and deteriorate more rapidly with a PGS smoother as (3 
increases. This is borne out in practice. Finally, as a rule of thumb, a good smoothing method has 
a smoothing factor p ^ In this sense, neither of the two methods considered here is a good 

smoothing method for problem (2). 

3 CYCLIC REDUCTION AND MULTIGRID 


Cyclic reduction (CR) is a direct method of solution for tridiagonal and block-tridiagonal 
systems of linear algebraic equations [9], [10]. For tridiagonal systems which represent 
approximations to 1-D second order ordinary differenti al equation s, CR is as efficient as multigrid 
(MG). For problems in higher dimensions CR becomes too computationally expensive due to fill-m 
within the blocks. However, the design of MG methods in higher dimensions can be facilitated by 
drawing comparisons between MG and CR (see Shaw [11]). 


Approximate Cyclic Reduction 


Consider the system of equations Lu = f . If v is an approximation to the true solution u, then 
we define the error vector as e := u — v a.nd the residual vector as r := f — Tv = Le. Then 
assuming that the error vector e is sufficiently smooth (a condition normally guaranteed by a few 
applications of a smoother in a MG algorithm), the fill-in can be minimised by making accurate 
Taylor expansion approximations of the outlying errors. This method is known as approximate 
cyclic reduction (ACR) [12]. 


Now consider a two-grid method applied to a two-dimensional Toeplitz system. Suppose the two 
grids have mesh sizes h and H = 2 h and the fine grid matrix has stencil 

r b i 


Lh ~ 


b a b , 

b 


where a and b are scalars. Given an initial approximation v to u, we want to solve the equation 
£ A e = r for e to obtain an improved approximation v + e. The method of ACR approaches this 
problem as follows. 


Eliminate the outlying errors in the stencil using neighbouring equations to give 

b 2 

2 b 2 0 2b 2 

L h ~ | b 2 0 4 b 2 - a 2 0 b 2 

2b 2 


0 

b 2 


2b 2 
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This first step of CR has destroyed the band structure of the original five-point operator. Further 
steps of CR would introduce more fill-in, resulting in a relatively inefficient process. Instead, 
assuming the errors are sufficiently smooth, approximate the errors at the NW, NE, SW and SE 
positions (in compass point notation) using accurate Taylor series expansions. This defines the 
ACR-modified coarse grid matrix, which has stencil 


Lff ~ <7 


2b 2 

0 

26 2 0 8 b 2 -a 2 0 26 2 

0 

2b 2 


5 


where a is an arbitrary scaling parameter. From the above information, the definition of restriction 
from the fine grid to the coarse grid can also be gleaned. The ACR-modified restriction operator 
has stencil 


R 


H 

h 


~ C 7 



( 5 ) 


For theoretical considerations it is very convenient to choose restriction and prolongation operators 
which satisfy the relation Pjj = where Rff* is the adjoint operator of Rff with respect to a 
suitably defined scalar product. However, the adjoint of the five-point restriction operator (5) is not 
a reasonable prolongation (see [13], p. 78). Alternative definitions of the prolongation operator are 
discussed in the following subsection. 


ACR and the Helmholtz Equation 


Consider a two-grid method, with mesh sizes h and H = 2 h, applied to the fine grid Helmholtz 
differential operator C^u := V 2 u -f Su. Using a central difference discretisation on a mesh of n x n 
intervals, the fine grid matrix has stencil 


L h 




i 

i p i 
l 


where h := £ and p := 6h 2 — 4. Hence a = and b = Now if we choose a — , then the 

ACR-modified coarse grid matrix and restriction operator have stencils 


Lh 


(2hy 


1 o 


i 

0 

4 _ i~2 


P l 0 1 


0 

1 


i H 

l h 


Ry ~ 


l 

l —p l 

l 
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Therefore the analogous coarse grid Helmholtz differential operator is defined as 
£ hU ._ y 2 « + ^ j. e . ACR suggests solving the Helmholtz equation with a different value 

of 8 on the coarse grid in order to stabilise the MG process. For positive real values of 8 for which 
Lh is indefinite, this corresponds to solving the Helmholtz equation with a smaller value of 8 on the 
coarse grid, thus reducing the indefiniteness of Lfj. There are various ways to define the 
prolongation operator. Possibilities include seven-point and nine-point prolongation [14]. However, 
a more effective definition of the prolongation operator for this interface problem is 


4 —4 p 4 

— 4p 3 p 2 —4 p , 

4 —4 p 4 

which is derived from the tensor product of the one-dimensional ACR-modified prolongation 
operator. To extend these ideas to an m-grid process, where hi is the mesh size of grid D; and 
h{+i = 2 hi, we proceed as follows. 



Define Si := 8 and 8 k := $*_,( 1 - *&&) := 6 k . i c k (2 < k < m ) and p k := 8 k h 2 - 4. Then the 
differential operator on grid Q k is defined as 


C k u := V 2 u + 8 k u, 

for 1 < k < m , provided c = -£■. Therefore, the ACR-modified definitions of the matrix of the 
discrete system on grid Cl k and the restriction and prolongation operators have stencils 


L k ~ 


1 


i 


i 


Pk i 
i 


7 


R 


k + 1 
k 


1 

1 ~Pk 1 
1 




4 —4 p k 4 

-ip k 3 p\ -4 p k 

4 — 4pjt 4 


respectively. We call this ACR-modified multigrid process CR-MG. Note that the CR-MG 
restriction operator is similar to the operator naturally suggested by the principle of total reduction 
(see [15] and [16], for example). Further, for Laplace’s equation (i.e. 8 = 0), p k = -4 and the 
CR-MG restriction operator corresponds to half weighting. 


4 NUMERICAL RESULTS 


Consider the complex two-dimensional Dirichlet boundary value problem 

V 2 u + 8u = 0 in Q = Hi U fl 2 : unit square 
s.t. u = g on dQ, 
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with data 


f 30 + lOi in f7 2 : | < x,y < | 

\ 1 in flj : f7\T7 2 

f sin(4 y - §)x on x = 0 , § < y < | 
I 0 elsewhere on dCl 


For convenience, we consider a domain 17 consisting of two concentric squares. The value of 6 in 172 
is a typical value calculated from the data in [1]. In the following experiment we assess the 
efficiency of the CR-MG algorithm, as described in the preceding section, and compare it with 
standard MG using full weighting restriction and nine-point prolongation. 


The problem is discretised piecewise according to (3) and (4), using central differences on a 
65 x 65 grid. A four-grid method is employed, with standard grid coarsening. This ensures good 
resolution of the inner subdomain 172 on the coarsest grid. The multigrid schedule used is the 
V-cycle with two pre-smoothing and two post-smoothing iterations, and LU decomposition with 
partial pivoting is used to solve the defect equation exactly on the coarsest grid. The initial 
estimate is taken to be the zero vector and convergence is measured by log 10 ||r|| 2 , where r is the 
residual vector and 1 1 * 1 is the usual Euclidean norm. 


With convergence set to a tolerance of 

lo giolMI 2 < “9. 

the convergence times of MG and CR-MG with PGS and KACZ smoothers were measured and the 
results are displayed in Table 1. All convergence times were measured in seconds on a Sun SPARC 


Table 1: CPU Convergence Times 


time (s) 

PGS 

KACZ 

MG 

22.8 

191.6 

CR-MG 

18.5 

155.9 


workstation. We immediately notice that both MG and CR-MG converge much more rapidly with 
a PGS smoother than with a KACZ smoother. This is not unexpected, considering the smoothing 
properties of these two iterative methods. Further, KACZ is a more computationally intensive 
smoother than PGS, having a 13-point stencil as compared to the 5-point stencil of PGS. 

However, most importantly, we find that with both smoothers the rate of convergence of 
CR-MG is significantly faster than that of MG. In fact, with both smoothers CR-MG provides a 
19 percent saving in CPU time over MG. This is a significant saving, especially for larger problems. 
The rates of convergence of MG and CR-MG with a PGS smoother are compared graphically in 
Fig. 3. Both plots are approximately straight lines, a consequence of the grid-independent 
convergence of the multigrid method. 
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Figure 3: Convergence of MG and CR-MG with a PGS smoother. 


5 CONCLUDING REMARKS 


In this paper, attention has been focussed on improving the design of the standard multigrid 
method with respect to a particular problem, namely the complex-valued microwave oven problem. 
By drawing a comparison with the direct method of cyclic reduction,, improved discretisation, 
restriction and prolongation operators have been designed, resulting in savings of up to 19 percent 
in CPU time used. 

Only two smoothing methods have been considered here, point GauB-Seidel and Kaczmarz. 
However, there are many more robust smoothers, such as alternating damped Jacobi, alternating 
symmetric line GauB-Seidel and incomplete LU decomposition. These methods, and many more, 
have been summarised and analysed in detail in [7]. Improvements in the convergence properties of 
the modified multigrid method (CR-MG) will almost certainly be realised by using such smoothers. 

Finally, attention in this paper has been restricted to the microwave oven problem, although the 
ideas presented here can be extended to other problems. For example, in [11], these ideas were 
applied to the convection-diffusion equation and it was shown that approximate cyclic reduction 
can be used to define the ideal quantity of coarse grid artificial viscosity and the direction in which 

it lies. 
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ABSTRACT 


In this paper, we present an analysis of a multigrid method for nonsymmetric and/or indefinite elliptic 
problems. In this multigrid method various types of smoothers may be used. One type of smoother which we 
consider is defined in terms of an associated symmetric problem and includes point and line, Jacobi and Gauss- 
Seidel iterations. We also study smoothers based entirely on the original operator. One is based on the normal 
form, that is, the product of the operator and its transpose. Other smoothers studied include point and line, 
Jacobi and Gauss-Seidel. We show that the uniform estimates of (ref. 6) for symmetric positive definite problems 
carry over to these algorithms. More precisely, the multigrid iteration for the nonsymmetric and/or indefinite 
problem is shown to converge at a uniform rate provided that the coarsest grid in the multilevel iteration is 
sufficiently fine (but not depending on the number of multigrid levels). 


1. INTRODUCTION 


The purpose of this paper is to study certain multigrid methods for second order elliptic boundary value 
problems including problems which may be nonsymmetric and/or indefinite. Multigrid methods are among 
the most efficient methods available for solving the discrete equations associated with approximate solutions of 
elliptic partial differential equations. Since their introduction by Fedorenko (ref. 15), there has been intensive 
research toward the mathematical understanding of such methods. The reader is referred to (ref. 19), (ref. 17) and 
(ref. 3) and the bibliographies therein. Most of these works concern symmetric, positive definite elliptic problems 
although a few consider nonsymmetric and/or indefinite problems. In particular, (ref. l),(ref. 18), (ref. 10) and 
(ref. 24) deal with such multigrid algorithms and are most closely related to the subject of this paper. All of these 
papers share the requirement that the coarse grid be sufficiently fine. We shall briefly describe their contents. 


•This manuscript has been authored under contract number DE-AC02-76CH00016 with the U.S. Department of Energy’. Accordingly, the 
U.S. Government retains a non -exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to 
do so, for U.S. Government purposes. This work was also supported in part under the National Science Foundation Grant No. DMS-9007185 
and by the U.S. Army Research Office through the Mathematical Sciences Institute, Cornell University. The second author was also partially 
supported by the Korea Science and Engineering Foundation. 
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The paper by Bank (ref. 1) derives uniform convergence estimates for the W-cycle multigrid iteration with 
both a standard Jacobi smoother and a smoother which uses the operator times its adjoint. In each case, a 
sufficient number of smoothings are required and a sufficiently fine coarse grid, depending on the number of 
smoothings, is needed. Some regularity for the elliptic partial differential equation was also required. 

Mandel studied the V-cycle iteration and showed that it was effective with only one smoothing and a 
sufficiently fine coarse grid. His result requires that the underlying partial differential equation satisfies the “full 
elliptic regularity” hypothesis and generalizes the results of Braess and Hackbusch (ref. 2) for the symmetric 

positive definite problem. - __ ! 

Bramble, Pasciak and Xu (ref. 10) studied the symmetric smoother introduced by Bank and showed that 
the W-cycle and variable V-cycle worked without making the undesirable requirement of “sufficiently many 
smoothings”. Somewhat more than minimal regularity was needed. * 

In (ref. 24), Wang showed that, for the standard V-cycle with one smoothing, the “reduction factor” for the 
iteration error was bounded by 1 — CjJ + C\hi where J is the number of levels, h\ is the size of the coarsest grid 
and C and C\ are constants. This estimate deteriorates with the number of levels and will be less than one only if 
the coarse grid is subsequently finer as the number of levels increases. Minimal elliptic regularity was assumed. 

In this paper uniform iterative convergence estimates for V-cycle multigrid methods applied to nonsymmetric | 

and/or indefinite problems are proved under rather weak assumptions (e.g., the domain need not be convex). 

Uniform estimates were shown to hold in (ref. 6) and (ref. 8) for the V-cycle with one smoothing step in i 

the symmetric positive definite case under such hypotheses. We show that these results carry over to the 

nonsymmetric and/or indefinite case for a variety of smoothers. The coarse grid must be fine enough but need 

not depend on the number of levels J. Such a condition seems unavoidable since, in many cases, it is needed even 

for the approximate problem to make sense. 

In recent years, some other techniques have been proposed to handle the nonsymmetric indefinite case. One | 

approach in (ref. 14), (ref. 4) and (ref. 7) is to precondition with a symmetric operator and then solve certain 
normal equations by the conjugate gradient method. One possible advantage of such a method is that some \ 

nonsymmetric problems which are not “compact perturbations” of symmetric ones may be treated. Of course, the 
usual normal equations may be formed and then preconditioned (cf. (ref. 7) and (ref. 20)); this approach seems 5 

to be rather restrictive in that good preconditioners may be difficult to construct. Other recent approaches have I 

included Schwarz type methods (ref. 12) and two-level methods in which a “coarse space” is introduced to reduce 
the problem to one with a positive definite symmetric part (cf. (ref. 4), (ref. 13) and (ref. 25)). 

The remainder of the paper is organized as follows: In Section 2, we describe a model problem and introduce * 

the multigrid method. In Section 3, smoothers based on the symmetric problem (and used in our nonsymmetric - 

and/or indefinite applications) are defined and the relevant properties which they satisfy are stated. Section 
4 develops smoothers based on the original problem. The main results of the paper, which provide iterative 
convergence rates for the multigrid algorithms with the smoothers of Sections 3 and 4, are given in Section 5. 

2. THE PROBLEM AND MULTIGRID ALGORITHM. 

We set up the model nonsymmetric problem and the simplest multigrid algorithm in this section. We consider, 
for simplicity, the Dirichlet problem in two spatial dimensions approximated by piecewise linear finite elements 
on a quasi-uniform mesh. The multigrid convergence results hold for many extensions and generalizations as 
discussed at the end of Section 5. 
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I iiiiiiiiiiiiiii 


We consider as our model problem the following second order elliptic equation with homogeneous boundary 
conditions. 


( 2 . 1 ) 



au = f 


u = 0 


in ft, 
on 5ft, 


where ft is a polygonal domain (possibly nonconvex) in R 2 and {a,;(x)} is bounded symmetric, and uniformly 
positive definite for x G ft. We assume that is in the Sobolev space (ft) for p > 2/7 (see, (ref. 16) for 
the definition of U^(ft)). Further, we assume that 6, is continuously differentiable on ft and that \a\ is bounded. 
Finally, we assume that the solution of (2.1) exists. 

Let H l (Q) denote the Sobolev space of order one on ft (cf., (ref. 16)) and let i?o(ft) denote those functions in 
tf^ft) whose trace vanish on 5ft. For v ) w G //^(ft), define 

( 2 . 2 ) ») = ± l ^ * + /„ ««• <'*• 

The solution u of (2.1) satisfies 

(2.3) A(u , v) = (/, v) for all v G i/(j(ft), 
where (*, *) denotes the inner product in L 2 (Q). 


For the analysis, we introduce a symmetric positive definite form A(- i ■) which has the same second order part 
as A(- } •). We define A ( *, •) by 




The difference is denoted by 

D(u, v) = A(u t t>) - i4(u, v ). 

The form £)(•, ■) satisfies the inequalities 


(2.4) |£>(u, v)| < C||u||i ||v|| and |D(u,t;)|<C||«||||t;||i. 

Here ||-||j and ||-|| denote the norms in H 1 (Q) and X 2 (f2) respectively. The second inequality above follows 
from integration by parts. Here and throughout the paper, c or C, with or without subscript, will denote a 
generic positive constant. These constants can take on different values in different occurrences but will always be 
independent of the mesh size and the number of levels in multigrid algorithms. 


By the assumptions on the coefficients appearing in the definition of ^4(*, «), it follows that the norm A(v } d ) 1 / 2 
for v G H l ( ft) is equivalent to the norm on tf^fl). Thus, we take 


Mli =M v > v ) 


1/2 


We develop a sequence of nested triangulations of ft in the usual way. We assume that a coarse triangulation 
{r/} of ft is given. Successively finer triangulations {r/ n } for m > 1 are defined by subdividing each triangle 
(in a coarser triangulation) into four by connecting the midpoints of the edges. The mesh size of {r/} will be 
denoted to be d\ and can be taken to be the diameter of the largest triangle. By similarity, the mesh size of {r/ n } 
is 2 1 “ m d i . 


For theoretical and practical purposes, the coarsest grid in the multilevel algorithms must be sufficiently fine. 
In practice, however, the coarse grid is still considerably coarser than the solution grid. Let L and J be greater 


45 



than or equal to one and set Af*, for k = 1, . . . , J, to be the functions which are piecewise linear with respect to 
the triangulation {rl+ L }, continuous on fi and vanish on 8Q. Since the triangulations are nested, it follows that 

Mi C M 2 C ...C Mj, 


The space Af* has a mesh size of hk = 2 [ ~ L ~ k di = 2 1 ~ k h[. 

Fix Jb in {1,2,...}. Let us temporarily assume that for every u £ Af*, 

(2,5) A(u, v) = 0 for all v G Afj k implies u = 0. 

This assumption immediately implies the existence and uniqueness of solutions to problems of the form: Given a 
linear functional F(‘) defined on Afjt, find u £ Mk satisfying 

A(u, <f>) = for all <f> £ Mk. 

In particular, the projection operator Pk : H l ( fl) »— ► Mk satisfying 

A(PkU , u) = A(u t v) for all v £ Mk ) 


is well defined. 

Clearly, if (2.2) has a positive definite symmetric part then (2.5) holds. More generally, if solutions of (2.1) 
satisfy regularity estimates of the form 

(2.6) ll u lh+o ^ CWfW-i+a, 

then, it is well known (cf., (ref. 22)) that there exists a constant ho such that for hk < fco, (2.5) holds and 
furthermore 

(2.7) ||(/-P t )u|| <cA?||(/-P*)u||i. 
and finally, 

(2.8) ||A«|li<C|M|,- 

Even if regularity estimates of the form of (2.6) are not known to hold, then (2.5) is known from a recent result by 
Schatz and Wang (ref. 23). 

Lemma 2.1 (ref. 23). There exists an h 0 such that (2.5) holds for h k < h 0 . Moreover , given c > 0, there exists 
an ho(e) > 0 such that for all hk G (0, ho], (2.8) holds and 

(2.9) \\(I-Pk)u\\<e\\(I-P k )u\U. 


Remark 2.1. The above e will appear in our subsequent analysis. We note that € can be taken arbitrarily small. 
However, L will be taken large enough so that (2.5), (2.8) and (2.9) hold. Thus, the coarse grid size (i.e., L) for 
any estimate in which e appears will depend on e. 


In our analysis, we shall use the orthogonal projectors Pk : Hq(Q) »— ► Mk and Qk : L 2 (fl) h + Mk which, 
respectively, denote the elliptic projection corresponding to A(-, •) and the L 2 (Q) projection. These are defined by 

A{PkU , f) = A{u , v) for all v £ Mk, 
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and 

*0 = (ti, v) for all v E Af*. 


The multigrid algorithms will be defined in terms of an additional inner product (•>•)* on MkMk* Examples of 
this inner product in our applications will be given in the next section. Additional operators are defined in terms 
of this inner product as follows: For each h, define A k : M* — ► Af* and A* : Af* — ► M * by 

(A*u, v)k = A(u, v) for all v E Mk, 

and 

(A*u, v)k = A(u, v) for all v E Mk . 

Finally, the restriction operator Pj*_ j : Af* i— ► Mk-i is defined by 

(Pj.jU, v)k-i - (u, v)k for all v € Mk-i ■ 

We seek the solution of 

(2.10) A(u, v) = (/, v ), for all v E Mj. 

This can be rewritten in the above notation as 

(2.11) Aju = Qjf . 


We describe the simplest V~cycle multigrid algorithm for iteratively computing the solution u of (2.3). Given 
an initial iterate uo E Mj , we define a sequence approximating u by 

(2.12) u I+ , = M gj(u h Qjf). 

Here Mg/(*, •) is a map of MjMj into Mj and is defined as follows. 

Definition MG, Set Mgi(v,w) = A^w. Let k > 1 and v t w be in Mk* Assuming that Af^_i(-,-) has been 
defined, we define Mgk(v,w) by: 

(1) x k = v + R k {w - Ak v). 

(2) Mg k (v, w) = Xk + q, where q is defined by 


9 = Pk-iiw - vU**)). 


Here Rk : Mk ►-* Mk is a linear smoothing operator. Note that in this V-cycle, we smooth only as we proceed 
to coarser grids. 

In Section 3, we define Rk in terms of smoothing operators defined for the form A(*, •). Specifically, the 
smoothing procedure for the symmetric problem will be denoted Rk : Mk Mk and we set Rk = Rk* In Section 4, 
we consider smoothers which are directly defined in terms of the original operator A*. 

A straightforward mathematical induction argument shows that Mgj(-, *) is a linear map from MjMj into Mj. 
Moreover, the scheme is consistent in the sense that v = Mg j{v,Ajv) for all v E Mj . It easily follows that the 
linear operator E = Mgj(-,0) is the error reduction operator for (2.12), that is 


u - = E(u - Ui ). 
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Let Tk = RkAkPk for ifc > 1 and set T\ = Pi. Using the facts that P®_ij4* = Ak-iPk-i and Pk-iPk = Pk-i and 
Definition MG, a straightforward manipulation gives that for k > 1 and any u G Mj, 

u — Mg*(0, AkPku) = (I — Tk)u — Mg*_i(0, Ak-iPk-i{I — 7*)«). 

Let EkU = u - Mg*(0, AkPku)- In terms of Ek, the above identity is the same as 

Ek = E k - i{I-Tk). 


Moreover, by consistency, E = Ej and hence 

(2.13) E = (I-T 1 )(I-T 2 )--(I-Tj). 

The product representation of the error operator given above will be a fundamental ingredient in the convergence 
analysis presented in Section 4. Similar representations in the case of multigrid algorithms for symmetric problems 
were given in (ref. 9). 

The above algorithm is a special case of more general multigrid algorithms in that we only use pre-smoothing. 
Alternatively, we could define an algorithm with just post-smoothing or both pre- and post-smoothing. The 
analysis of these algorithms is similar to that above and will not be presented. 

Often algorithms with more than one smoothing are considered (ref. 3), (ref. 17), (ref. 19). This is not advised 
in the above algorithm since the smoothing iteration is generally unstable. 


3. SMOOTHERS BASED ON THE SYMMETRIC PROBLEM. 

In this section, we consider smoothers which are based on the symmetric problem. The symmetric smoother 
will be denoted by Rk . We state a number of abstract conditions concerning these smoothing operators. We 
then give three examples of smoothing procedures which satisfy these assumptions. In Section 5, we provide 
convergence estimates for multigrid algorithms with Rk = Rk in Definition MG. 

The first two conditions are standard assumptions used in earlier multigrid analyses. For k > 1, let Kk = 

I - R k Ak (defined on M k ) and f* = RkMPk (defined on Mj). We assume that: 

(1) There is a constant Cr such that 

(C.l) - C R (^ kU ' u )*’ for a11 u G Mk ’ 

where R k = (/-A* Kk)A; 1 and At is the largest eigenvalue of A k . Here and in the remainder of this paper, 
* denotes the adjoint with respect to the inner product A(-, •). 

(2) There is a constant 6 < 2 not depending on k satisfying 

(C.2) A(TkV, f k v) < 0A(TkV, v) for all v 6 Mk . 

Provided that (C.2) holds, (C.l) is equivalent to 

(3.1) < C(R k u, u) k , for all u G M k . 

When Rk is symmetric with respect to (*, )jt, (C.2) states that the norm of T* is less than or equal to 8. Even in 
the case of non-symmetric J2*, (C.2) implies stability of ( I — Ti). In fact, for any w £ Mj , (C.2) implies that 

i((J - Tk)w , (I - T k )w) ~ A{w t w) - 2i(T*tz>, w) + A{t k w, f k w) 

(3 2) < A(w t w)-(2- 0)A(fkW, w) < A(w, w). 
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The final condition is that for k > 1, there exists a constant C satisfying 
(C.3) (T*it, T k u) k < CX k 1 A(fku, u) for all u € M*. 

A simple change of variable shows that (C.3) is the same as 

(RkV, Rkv)k < CX k x {RkV , v)* for all v G M*. 

In the case when Rk is symmetric, this is equivalent to 

(3.3) CRitv, v)k < CX^(v, v)k for all v € Mk 

and is the opposite inequality of (3.1). Note that both (C.2) and (C.3) hold on Mj. 

Remark 8.1. If Conditions (C.1)-(C.3) hold for a smoother Rk then they hold for its adjoint R\ with respect to 
the inner product (•, •)*. This means that (C.l) holds for Rk = (I — KkK^)A^ x and that (C.2) and (C.3) hold 
with replacing T*. In the case of (C.2) and (C.3), the corresponding inequalities hold with the same constants 
as those appearing in the original inequalities. 


Example L The first example of a smoother is the operator 

Rk = X k 'l 

where I denotes the identity operator on Mk and A* < A* < CXk- In this case, (3.1) holds with C = Xk/Xk, (C.2) 
holds with 0=1 and (3.3) holds with C = Xk jXk. To avoid the inversion of L 2 Gram matrices in the multigrid 
algorithm, we use the inner product 

(3.4) (u, v) k = h 2 k ^2 u(xj)v(xj). 

i 

Here the sum is taken over all nodes x, of the subspace Mk- Note that (•, -)k is uniformly (independent of k ) 
equivalent to (•, •) on Mk. 

The remaining smoothers correspond to Jacobi and Gauss-Seidel, point and line iteration methods. We shall 
present these smoothers in terms of subspace decompositions. Specifically, we write 

t 

(3.5) M k = Y, M 'k 

1=3 

where M' k is the one dimensional subspace spanned by the nodal basis function <f>\ or the subspace spanned by 
the nodal basis functions along a line. The number of such spaces l = l(k) will often depend on fc. These spaces 
satisfy the following inequality. 

(3.6) IM| < Chk IMIi for all v € Ml . 


Example 2. For the second example, we consider the additive smoother defined by 

f 

(3.7) Rk = y'%2A;}Qk,i. 

1=1 


Here A*,; : Ml — ► M' k is the defined by 


(Akjv, x)k = A(v, x) for all x G M' k 
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and Qk,i : M k — ► M k is the projection onto M' k with respect to the inner product (•, •)*. The constant y is a scaling 
factor which is chosen to ensure that (C.2) is satisfied (see, e.g., (ref. 11), (ref. 5)). Note that R k is symmetric 
with respect to the inner product (•, •)*. In addition, (3.1) and (3.3) are shown to hold in (ref. 11) with point 
Jacobi. When the subspaces M' k are defined in terms of lines, (3.1) was proved in (ref. 5). The estimate (3.3) 
easily follows in the line case using the support properties of the basis functions and (3.6). For this example, we 
take (•, - ) k = (•, •) for all k. 


Example 3. We next consider the multiplicative smoother. Given / € M k , we define R k by 

(1) Set vo = 0 G M k . 

(2) Define v;, for t = 1, . . . , l, by 

V; = Vj~\ + A k *Q k j(f - A k Vj-i). 

(3) Set R k f = v/. 

Conditions (C.l) and (C.2) are known for this operator (see, e.g., (ref. 5)). The next lemma shows that (C.3) 
holds for this choice of R k . For this case, we also take (•, ■)* = (•, •) for all k. 

Lemma 3.1. (C.3) holds when R k is defined to be the multiplicative smoother of Example 3. 

Proof. The proof uses the techniques for analyzing smoothers presented in (ref. 5). Fix k > 1 and let 

(3.8) & = (i - Pi)(i - Pt 1 ) •••(/-n 1 ) 

where P k denotes the .4( • , •) projection onto M k and Sq = I. Note that (J — T]t) = Si and S;~\ — Si + P k Si- j. Hence 

i 

t k = i-s, = j2 p& l 

r=l 

and for every u e M kt (cf. , (ref. 5)) 

i((27 - 7i)u, T k u) = A(u, u) - A(S,u, S,u ) 

f 

= iu,£-i«). 

i= 1 

Since h\ < cAjf 1 , the proof of the lemma will be complete if we can show that 

(3.9) (Tjtu.fjfeu) < ch 2 k J2A(P’ k Si-iu,Si-iu). 

i-1 


Expanding the left hand side of (3.9) gives 

/ / 

(3.10) (T k u, T k u) = J2 PiSj-iu). 

r=l ;=1 

Because of the support properties of the subspaces {Af/} satisfy a limited interaction property in that for 
every i, the number of subspaces j for which (t/, v j ) / 0, with i; 1 G and € M J k is bounded by a fixed 
constant no not depending on k or /. Lemma 3.1 of (ref. 5) implies that the double sum of (3.10) can be bounded 
by no times its diagonal, i.e. 

/ 

(3.11) (T*u, T k u) < n 0 ^(P*’5,-iu, P' k Si- 1 «). 

f=i 


I 
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Applying (3.6) gives 

( 312 ) {%&- i«, PiSi-iu) < ChlAiHti- ,«,£_,«). 

Combining (3.11) and (3.12) proves (3.9). This completes the proof of the lemma. 

Remark 3.2. The same analysis could be used for successive overrelaxation type iteration. In that case, 

£, = (/- /3P' k )(I - pP l k ~') •••(/- QPt) 

where (3 E (0, 2) is the relaxation parameter. 

4, SMOOTHERS BASED ON A*. 


In this section, we consider smoothing operators R* which are defined directly in terms of the nonsymmetric 
and/or indefinite operator A* . The first smoother is one that was originally analyzed in (ref. 1) and subsequently 
studied in (ref. 10). 


Example For our first example of a smoother based on A*, we consider Rk defined by 

Pk = 

Here, Aj. is the adjoint of A* with respect to the inner product (•, •)* and A* is its in Example 1. A possible 
motivation for such a choice is that, on Mt, the iteration 

= v i ~ 1 + X^A[(f-A k iT 1 ) 

is stable in the norm (-, ■) k provided that Af is greater than or equal to half the largest eigenvalue of A*A*. 


Example 5. This example is closely related to the second example of the previous section. As in that example, we 
define the line or point subspaces {Afj} for * = 1 /. Note that the form A(-, •) satisfies a Garding inequality 

cj A(u, tt) - c ||u|| 2 < A(u, it) for all u 6 #,}(fi). 

Consequently, by (3.6), 

(cj - Ch k )A(u, it) < A(u, u) for all u 6 Af*. 

We will assume that h 2 is sufficiently small so that 

( 41 ) Ch\ < ci/2. 

This means that A(-, •) restricted to Af* has a positive definite symmetric part. Hence, the projector P* : M k i— *■ 
M f h satisfying 

A(Plv ) w) = A(v , u;) for all w E Ml 

is well defined and satisfies 


W < C'||u|| i n j . 

The second norm is taken only over the subdomain fl* which is the set of points of fi where the functions in 
are nonzero. In addition, the operator A* ( i : M k i— ► M k defined by 

(A*, ,•», u>)k = A(v, w) for all v, w € M' kt 
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is invertible. We set Rk by 


i 


Rk =y^2^k}Qk,i- 

i-l 


We choose 7 as in Example 2 so that the symmetric smoother defined by (3.7) satisfies (C.2). 


Example 6. Our final example is that of Gauss-Seidel directly applied to the nonsymmetric/indefinite equations. 
We assume that the subspaces {M' k } satisfy the conditions of the previous example. The block Gauss-Seidel 
algorithm (based on A k ) is given as follows: 


(1) Set vo = 0 G Mjt. 

(2) Define v;, for * = 1, , /, by 


Vi = Vi-i + A^Qkjif — AkVi-i). 


(3) Set Rkf = v\. 


5, ANALYSIS OF THE MULTIGRID ITERATION (2.12). 


We provide an analysis of the multigrid iteration (2.12) in this section. This analysis is based on the product 
representation of the error operator (2.13). All of the analysis of this section is based on perturbation from the 
uniform convergence estimates for multigrid applied to symmetric problems. 

We start by stating a result from (ref. 6) estimating the rate of convergence for the multigrid algorithm applied 
to the symmetric problem. Specifically, we replace A k by A k and R* by R k in Definition MG. Se i i- 
From the earlier discussion, the error operator associated with this iteration applied to finding a so u 10 

symmetric problem 

Aju = Qjf 


is given by E = Ej where 


(5.!) E k = {I-f x ){I-f 2 )---{I-f k ). 

We then have the following theorem. 

Theorem 5.1 (ref. 6). For k > 1, let R k satisfy (C.l) and (C.2). Under the assumptions on the domain Q and 
the coefficients of (2.1) given in Section 2, there exists a positive constant 6 < 1 not depending on J such that 


A(Eju, Eju) < 6 2 A(u, u) for all u € ~ i ^ - 1 "' 


To analyze the multigrid algorithms using the smoothers of Section 3, we use the perturbation operator 

Z k =T k - ft. 


We note that for any u, v € AO, for k > 1, 
(5.2) 


A(Z k u,v) = D(u,T k v). 


Indeed, by definition, 


A(T k u,v) = (T k u,A k P k v) k = (A k P k u,R) k AkP k v) k 
= (A k P k u,T k v) k = A(P k u,T k v) 


The equality (5.2) immediately follows. 

To handle the case of it = 1, we have 

(5-3) A(Z jti.v) = £>((/- Pi)u, A«). 

In fact, by definition, 

A(P x u, v ) = A(P x u, P x v) 

= A(u, P x v) - D(P x u, Piv) 

= A(P 1 u,v) + D((I-P 1 )u,P 1 v). 

The following theorem provides an estimate for the multigrid algorithm when the smoothers of Section 3 are used. 

Theorem 5.2. Let R k = R k and assume that (C.1)-(C.3) hold. Given e > 0, there exists an ho > 0 such that for 
hi < ho, 

A(Eu, Eu ) < 6 2 A(u, u) for all u € Mj, 

for 6 = 6 + c(h x + e). Here 6 is less than one (independently of J) and is given by Theorem 5.1. 


Proof. For an arbitrary operator O : Mj *-* Mj, let \\0\\ A denote its operator norm, i.e., 

I|0|U= sup -T A(O u ,v) 

u,v£Mj A(u, V.y! 2 A(v, v) 1 ! 2 


Applying (2.4), (2.9) and (2.8) to (5.3) gives 

\A(Z lU , v)\ < Ce || (7 - P,Wli |H|, < Ce ||«i||, |M|, . 

This means that the operator norm of Z x is bounded by C(. Since the operator norm of (I - Pi) is less than or 
equal to one, the triangle inequality implies that the operator norm of (7 — P]) = (I — Pi — Z x ) is bounded by 
1 + Ce. 

For k > 1, applying (2.4), (C.3), Remark 3.1, and (3.2) to (5.2) gives 

\A(Z k u,v)\ < ch k ||u||j A(f k v,v) 1/2 

<<MMIi Mil. 

i.e., the operator norm of Z k is bounded by ch k . Since, by (3.2), the operator norm of (7-7*) is less than or equal 
to one, the triangle inequality implies that the operator norm of (7 - T k ) = (7 - T* - Z k ) is less than or equal to 
1 + ch k . Hence, it follows that 

k 

II^IU<(l + Ce)JJ(l + cM<C. 

i~2 


It is immediate from the definitions that 

(5-4) E k -E k = (Pi,! - Pt_i)(7 - fit) - E k - X Z k . 

By (3.2) and the above estimates, for k > 1, 

(5-5) ii^* - aiu < m-i - 4-iiuii7 - fiiu + \\Ek-i\\ A \\z k \\ A 

<||p*-i-p t -i|| A +ch t . 
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Repetitively applying (5.5) and using 


\\Ei-Ei\\x = \\Zi\\x < Ct 


gives that 

|| Ej - Ej\\- a <Ce + Cj2h k < c(hi + e). 

k=2 

The theorem follows from the triangle inequality and Theorem 5.1. 

JJeraarjfc 5.1 . Note that e can be made arbitrarily small by taking h\ small enough. Consequently, Theorem 5.2 
shows that the multigrid iteration converges with a rate which is independent of J provided that the coarse grid is 
fine enough. The coarse grid mesh size can also be taken to be independent of J . 

We next consider the case of Example 4. For this example, we consider first the multigrid algorithm for the 
symmetric problem which uses 

(5.6) Rk = 

as a smoother. Prom the discussion in Section 2, the iteration (2.12) with R k (given by (5.6)) and A k replacing, 
respectively, R k and A k in Definition MG, gives rise to the error operator given by (5.1) where, as above, for 
k > l,f k = R k A k P k . The smoother (5.6) does not satisfy (C.l) and so the first step in the analysis of the 
nonsymmetric and/or indefinite example is to provide a uniform estimate for Ej given by (5.1). Such an estimate 
is provided in the following theorem. Its proof is given in the appendix. 

Theorem 5.3. Let Ej be given by (5.1) where f k = R k A k P k and R k is defined by (5.6). Then, 

A(Eju , Eju ) < 6 2 A(u, u ) for all u € Mj. 

Here 6 is less that one and independent of J . 

We can now prove the convergence estimate for multigrid applied to (2.1) using the smoother of Example 4. 

Theorem 5.4. Let R k be defined by Example 4. Given e > 0, there exists an h 0 > 0 such that for h x < h 0 , 

A(Eu, Eu) < 6 2 v4(u, u) for all u 6 Mj, 

for 6 = 6 + c(hi + e). Here 6 is less than one (independently of J) and is given by Theorem 5.3. 

Proof. For Jk > 1, we consider the perturbation operator 

Z k = T k — T k = X, l\A l k A k P k - A\P k ). 

Clearly, 

(5.7) Z k = - A k P k ) + (A{ - A k )A k P k ). 

As in (5.2), ^ __ „ 

A^ 1 A((A*P/t - A k P k )u,v) = A f l D(u,A k P k v) 

from which it follows using (2.4) that 

IIA^^P/t-^POlU^^*- 

A similar argument shows that , . 

\\\- k HA l k -A k )P k \\t<ch k . 
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It is not difficult to show that 

11411a <ca*. 

Combining the above estimates with (5.7) gives 

\\Zk\\x < \\K l A[\\ k \\\?(A k P k -A k P k )\\ A 

+ 11^(4 - 4)A|IaI|A*4A|Ia < ch k . 

The remainder of the proof is exactly the same as that of Theorem 5.2. This completes the proof of the theorem. 

We next consider the case of Example 5. We use perturbation from the multigrid algorithm for A which uses 
the smoother R k defined by Example 2. Theorem 5.1 provides a uniform estimate for the operator norm of Ej. 

Theorem 5.5. Let R k be defined by Example 5. Given e > 0, there exists an ho > 0 such that for hi < ho, 

A(Eu, Eu) < 8 2 A{u , u) for all u € Mj , 

for 6 = 6 4- c(hi 4- e). Here 6 is less than one (independently of J) and is given by Theorem 5.1 applied to Rk 
defined in Example 2. 


Proof. For this case, the perturbation operator Z k is given by 

i 


As in (5.3), 


Z k =^( p i-H)- 

i* 1 

M(H- H)u, V) = D((I — Pi)u, Piv). 


Applying (2.4), (3.6) and (4.2) gives 

(5.8) A{(Pt - P£)u, v) < ch k ||«|| ljn < 

and hence 

i 

A{z k u,v) < chk Nli,n‘ ll w lli,nt • 

Using the limited overlap properties of the domains, fij. gives 

\\Zk\\x < ch k . 

The remainder of the proof of the theorem is exactly the same as that given in the proof of Theorem 5.2. 

We finally consider the case of Example 6. We use perturbation from the multigrid algorithm for A which uses 
the smoother Rk defined by Example 3. Theorem 5.1 provides a uniform estimate for the operator norm of Ej. 


Theorem 5.6. Let Rk be defined by Example 6. Given e > 0, there exists an ho > 0 such that for h\ < ho, 

A(Eu , Eu) < 8 2 A(u, u) for all u G Mj, 

for 6 = 6 -b c(hi + e). Here <5 is less than one (independently of J) and is given by Theorem 5.1 applied with Rk 
defined as in Example 3. 


Proof. The perturbation operator for this example is 

Zk — T k — Tk — — £i 
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e i = (i-pL){i-H- 1 )~<i-Pk) 


where t\ is given by (3.8) and 
with So = /• As in (5.4), 

ii - Si = (7 - PL)(ii - 1 - Si- 1) - (p; - p*K,-i. 

Since the last two terms are orthogonal with respect to A {- , •) we have that 

II (i f - s^ u\\\ = ||(7 - PM - 1 - S-MW + ||(p; - H)Si- ,u||? . 

Because of (5.8) and the fact that the operator norm of (7 - P k ) is bounded by one, it follows that 

||(6 - £i)u|| A < ||(6-1 - ^i-l)«|| A + Chi ||£r-iu||i,fij ■ 

Summing over i, since So = So = 7, we obtain 

t 

(5.9) 11(6 - Si)u\\\ < Ch 2 k £ ||£-H|?, ni • 

i— 1 

We shall show that 

(5.10) ^ < C , ||u||^. 


By the arithmetic-geometric mean inequality, the definition £, and the limited interaction property (see (3.10) and 
above) it follows that 


(5.11) 


i-1 |'=1 1=1 

l II 1-1 II 2 

<cm\+2'£ i 

?=1 ||m=l 

< C(II U II A + S E IIA m ^ m -i u lli.ni ) 

m- 1 f — 1 

<c(ikia+£na“«-.«ig). 


1.0! 


m- 1 


In order to estimate the last term on the right of (511) we write 


( 5 . 12 ) 


Now by (5.8) 


(5.13) 


||Pr£»-iu|li = A{P?S m - m, Pf£,„-iu) 

= A((£ m -1 £m)u, (£m 1 ^m)u) 

= A((£ m _j — S m )u , (S m - 1 + £ m )u) — 2A(P k , S m -iu, S m u) 
— A(S m -\u, Sm-iv) — A(£ m u, S m u) 

- 2A(PrS m -iu, (7 - pns m -i)u). 


A(P?S m - iu, (7 - P?)Sm- iu) = A(P' k n S m -iu, (ff - P?)S m - i)u) 

<Ch t ||Pr^m-lti|UH^-lu|| linr . 
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Hence, combining (5.12) and (5.13), we have 


< C[A(£m-i «, e m -ni) - A(£ m u,£ m u)] + Ch\ ||£ m -iti||^ ? 


Summing over m we conclude that 

£ \\ p ^m-iu\\l < CMl + ChlZ^U^ . 

m=l m=3 

This together with (5.11) yields (5.10) when ft* is small enough. Finally, we obtain from (5.10) and (5.9) that for 
k > 1, 

\\Zk\\jL<Ch k . 

The remainder of the proof of this theorem is the same as that of Theorem 5.2. 


Remark 5.2 . The same analysis could be used for successive overrelaxation type iteration. In that case, 

e, = ( i-0Pi){i-ppi- l )...{i-pp£) 

where /? G (0, 2) is the relaxation parameter. 


Remark 5.8. Many extensions and generalizations of the techniques given above are possible. These techniques 
lead to uniform estimates for multigrid iteration methods for solving nonsymmetric and/or indefinite problems for 
the following applications. 

(1) Approximations using higher order nodal finite element spaces. 

(2) Three dimensional problems. 

(3) Problems with discontinuous coefficients as discussed in (ref. 6). 

(4) More general boundary conditions. 

(5) Problems with local mesh refinement as described in (ref. 11). 

(6) Finite element approximation of problems on domains with nonpolygonal boundaries as discussed in 
(ref. 6). 


In addition, the perturbation analysis given above can be combined with results for additive multilevel 
algorithms, for example, Theorem 3.1 of (ref. 6). This leads to new estimates for additive multilevel 
preconditioning iterations applied to indefinite and nonsymmetric problems. Provided that the coarse grid is 
sufficiently fine, the operator 

P = ±T k 

k=l 

has a uniformly (independent of J) positive definite symmetric part with respect to the inner product A(-, •) and 
has a uniformly bounded operator norm. These results extend to all of the applications discussed in Remark 5.3. 


6. APPENDIX 


We provide a proof of Theorem 5.3 in this appendix. We will apply the analysis given in the proof of Theorem 
3.2 of (ref. 6). Note that we cannot directly apply Theorem 3.2 of (ref. 6) since the smoother iZ* = 2 A* does 

not satisfy (C.l). We note, however, that Theorem 5.3 will follow from the proof of Theorem 3.2 of (ref. 6) if we 
show that (C.2) holds as well as (3.5) and (3.6) of (ref. 6) with T* replaced by T* defined above. Clearly, (C.2) 
holds with 0=1. The remaining two inequalities corresponding to (3.5) and (3.6) of (ref. 6) are 

(6.1) A(T kv,v) < (Cr} k ~ t ) 2 A( v t v) for all v G M/, / < k 
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and 


( 6 . 2 ) 


J 

A{v, v) < C ^2 A{T k v, v) for all v € Mj. 
M 


Here r\ is less than one and independent of k and l. 

Prom the definition of A*, we obviously have 

A(f k v,v) < X k l A{A k v,v) = A(f k v>v). 

As in (ref. ?), we have set T k = Aj” 1 ^. Inequality (6.1) follows from Lemma 4.2 of (ref. ?). 
Inequality (6.2) can be rewritten, 


(6.3) 


A(u,u) < c(a{Pxu,u) + ^A jt 2 

' k = 2 


To prove this we proceed as follows. Let u E Mj and Qq = 0. Then 

j 


A(u, u) = ^2 A(u, ( Qk - Qk-\)y) 




(6.4) 


< ^(Pi«, u) + A* 2 1 AfePfcujQ ^4(Qiu, 

4 * ^ x\(A k l (Qk — ( Qk ” Q*-i) u )*^ 

k= 2 ' 


Q\u) 

1/2 


Now, for k > 1, 


( A k l {Qk - Qk-i)u, {Qk - Qk-\)y)k 


= sup 
4>eMk 


(A: l/2 (Q k -Qk- iK^) 2 t 

(4 > i <t>)k 


_ ((Qk ~ Qk-l) u > (Qk Qfc-l)VOjfc 

"S ml 

By well known approximation properties, 

((Qk - Qk-iW, (Qk - Qk-i)i>)l /2 < C IKQfc - Qt-i)^ll < Ch k 

Combining the above estimates gives 

J 


l * 


(6.5) 


A(Q\U } Q\u) + ^ ^ A \{A k 1 {Qk — Qk-i) u <> {Qk Qk-\)u)k 

k = 2 

< c(a{Q x u, Q x u) + J2~ Xk II ” Q*-iMI 2 ) 

' k = 2 ^ 


< CL4(u, u). 


The last inequality of (6.5) is (4.5) of (ref. ?) and also can be found in (ref. ?). Combining (6.4) and (6.5) proves 
(6.3) and hence completes the proof of the theorem. 
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SUMMARY 

We develop a multilevel scheme for solving the balanced long transportation problem, that is, 
given a set {c fcj } of shipping costs from a set of M supply nodes S k to a set of N demand nodes D 3 , 
we seek to find a set of flows, '{x kj }, that minimizes the total costE^i x kj c kj . We require that 
the problem be balanced, that is, the total demand must equal the total supply. Solution 
techniques for this problem are well known from optimization and linear programming. We e xam ine 
this problem, however, in order to develop principles that can then be applied to more intractible 
problems of optimization. 

We develop a multigrid scheme for solving the problem, defining the grids, relaxation, and 
intergrid operators. Numerical experimentation shows that this line of research may prove fruitful. 
Further research directions are suggested. 

INTRODUCTION 

The transportation problem is the simplest of network flow problems. It is posed on a bipartite 
graph, consisting of a set of M supply nodes, a set of N demand nodes, and a set of arcs connecting 
them. Each supply node Si has a fixed amount ,s, of a commodity which it can provide. Each 
demand node Dj has a fixed requirement dj for that commodity, and for each arc (i t j) connecting 
supply node S { to demand node D 3 there is an associated cost per unit flow c i3 . When the total 
supply equals the total demand the problem is balanced. When M << N, the problem is referred to 
as a long transportation problem. Denoting the flow on arc (i,j) by Xij, the transportation problem 

•This work was supported in part by Naval Postgraduate School Research Council, Grant No. MAOOO-MA999/4476- 
4479 
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can be expressed 


M N N ** 

Minimize 5ZX/ C *j x< l subject to: Y2 Xi J = Si ' = Xij — 

i= 1 j = i i =1 i=1 

Let 6 denote an (M -I- iV)-vector whose first M entries are the available supplies s t at nodes S x 
through and whose last N entries are (negatives of) the required demands d 3 at demand nodes 
Di through D N . Let K be the number of arcs in the problem. Throughout this work we shall 
assume that every supply node is connected to every demand node, so that K = MN. Let the 
K-vector x be composed of the flow on the arcs from the M supply-nodes to the N demand nodes 
in some order, and the K-ve ctor c be the cost of shipping on those arcs in the same order. We 
denote by A the incidence matrix of the graph, so that A has as many rows as there are nodes in 
the problem, M + N, and as many columns as there are arcs (MN). Each column of A is 
associated with one arc of the problem, and they are arranged in an order that matches the order of 
the vectors c and x. Each column has exactly two non-zero entries: a +1 in the row corresponding 
to the tail (supply) node 5* of the arc, and a -1 in the row corresponding to the head (demand) 
node Dj. Each row of A is associated with one of the constraints of the problem [lj. Then t e 
problem may be written in matrix notation as 

Minimize: c T x 

Subject to: Ax = b, 

x>0. 

A simple example is presented in Figure 1. In the example, there are four supply nodes, having 
12, 15, 10, and 7 units of the commodity to deliver. There are three demand nodes, requiring 13, 
20, and 11 uni ts of the commodity. We seek to find a flow vector 

\ ^ 

In X12 Xl 3 ^21 ^22 #23 £31 £32 £33 £41 £42 £43 ) 

given that the vector of costs, written in corresponding order, is 

(215643174234 ) T . 

The algebraic description of this problem is to find x such that c T x is minimized, subject to the 
system of constraints 

^£n 
£12 

\ Xi i 

1 1 1 0 0 0 0 0 0 0 0 0 \ 21 

000111000000 
000000111000 X23 
000000000111 
-1 0 0 -1 0 0 -1 0 0 -1 0 0 

0 -1 0 0 -1 0 0 -1 0 0 -1 0 

0 o-io 0-10 o-io 0 _1 / X41 

£42 
\ X 43y 
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Figure 1: A simple example of a transportation problem 

Very little work has been done on multigrid methods for discrete optimization problems. 
Significant studies to date are [2], [3], [4], and [5]. The traditional optimization algorithm which 
most closely resembles a multilevel algorithm is aggregation/disaggregation [6], [7], and [8], in 
which nodes are aggregated in a logical way in order to reduce the size of the problem, and the 
solution to the smaller problem is disaggregated to provide an initial estimate for the solution to 
the original problem. The most successful work to date, and the work that inspired this study, is 
that of Kaminsky [4], 


COST-SPACE 

In [4] it is required that the demand nodes occupy a physical location in space, and that a 
relationship exist between transportation costs and distances. This is done so that the coarsening 
step may be performed by aggregating together demand nodes that are physically near one another. 
For this to make sense, it is necessary that shipment to each of the aggregated demand nodes 
involve a similar cost, which naturally occurs if the shipping cost is a function of distance. For 
many applications this makes perfect sense; the cost of shipping a commodity is often directly 
linked to the distance the commodity must be shipped. This restriction is overly limiting for other 
types of problems, however. For example, the manpower assignment problem, in which a specified 
number of jobs must be assigned a given set of workers, can be formulated as a transportation 
problem. There is no distance involved in such a problem, and cost of assignment is related to other 
factors, such as the cost of training an individual for a specific task. 

In order to address problems that have no geometrical dependence of cost on distance, we 
employ a change of coordinate systems from physical space to a space we describe as cost space. 

For the M x N problem, cost space is the M-dimensional space in which each of the coordinate 
axes is the cost of shipping from one of M supply nodes. Each of the N demand nodes is placed in 
cost space at the point whose coordinates are the unit costs of shipping from the supply nodes to it. 
For example, the three demand nodes in Figure 1 would be placed in a four-dimensional cost space, 
and would have the coordinates A = (2,6, 1,2), A = (1,4, 7, 3), and A = (5, 3, 4, 4). This change 
of coordinate systems means shipping cost becomes the metric of the problem, so that two demand 
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nodes are “near” each other only if the shipping costs are similar, and the aggregation of 
neighboring demand nodes automatically ensures the similarity of their costs. 

Posed in cost space, the dimensionality of the problem equals the number of supply nodes. In 
traditional multigrid methods, one typically uses grids that are tensor products of one-dimensional 
grids, each having a cardinality of gridpoints that is a power of two. In the cost space approach this 
would lead to a very rapid growth in the size of the problem; for this reason the cost space 
approach can be applied only to problems with a relatively small number of supply nodes. This is 
one reason for restricting our attention to the long transportation problem. 

Reduced dimension cost space 


If at least one supply node is connected to all demand nodes (and in our work we assume this to 
be true of all supply nodes) then we can transform the M x N transportation to an equivalent 
(jy — l) x N problem, which we call the reduced dimension problem. Since we are dealing with the 
long problem, the transformed problem is somewhat simpler and less expensive to solve. The 
transformation is accomplished as follows. Suppose that supply node S, is connected to all demand 
nodes. Then for each demand node Dj, we subtract cy, the cost of shipping from supply node S t to 
D , from all of the shipping costs into demand node Dj. That is, we form an auxiliary cost vector 
c*. = Ci . — ci j. The result is that for supply node Si, all the demand nodes map to the origin in cost 
space. Effectively S t has been removed from the problem, leaving an (M - 1) x N problem to be 
solved. For example, if we use the cost of shipping from S% on the example in Figure 1, the 
transformed cost vector becomes 

c = ( -4 -3 2 0 0 0 -5 3 1 -4 -1 3 ) T . 

We can show that while the objective function value is different for the new problem, a solution for 
one is equivalent to a solution for the other. 

Theorem 1 Let the M x N balanced long transportation problem be represented by a bipartite 
graph G, and suppose that supply node Si is connected to all demand nodes. Let b be the (M + N) 
length column vector whose first M entries are the supplies at the supply nodes and whose 
remaining N entries are the negatives of the demands at the demand nodes. Let A be the adjacency 
matrix of the graph G; that is, for each arc ( i,j ) we have A(i, ( i - l)N + j) = 1 and 
A(M + j, ( i - 1 )N 4- j) = -1. Let c be the (M + N) length vector whose k = (i - 1)N + j element is 
the cost dj of shipping from node Si to node Dj along arc ( i,j )- Define c to be the vector whose k th 
entry is c* = dj ~ Cij • Then x * is a solution to the problem 

Minimize: c T x 

Subject to: Ax = b , 

x > 0 , 

if and only if it is a solution to the problem: 

Minimize: c T x 

Subject to: Ax = 6, 

x > 0 . 
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Proof: 


M N M N 

X X cjy^jy = X X ( c fcj - c ij) x *j 

fc= i i=i fc=i i=i 

Af AT 

^ ^ y~! (cfcjXfcj _ cijXkj ) 

Jfc=l J=1 

M AT M N 

X X ~ X X c o x fcj 

fc=i i=i fc=i j=i 

N M N 

C T X ~ J2c h X kj = C T X ~ X C O d J> 
j=i fc=i i =1 

AT 

since = dj in the balanced problem. But c ijdj does not depend on x, and therefore 

fc=i 

c T x achieves its extreme values precisely when c x does. I 

This transformation of the costs to reduced-dimension space maps all costs of shipping from Si 
to the origin in cost space. As will be shown in the next section, our algorithm requires that the 
demand nodes be sorted once according to the cost of shipping. Since sorting is a fairly expensive 
operation, the savings generated by reducing the dimension of the problem are tangible. Once the 
transformation to reduced-dimension cost-space has been performed, the resulting problem may be 
solved with no further consideration of the transformation. Therefore, in the remainder of this work 
it is assumed that when an M x N problem is to be solved, it may be the reduced dimension 
version of a problem that was originally ( M + 1) x N. 

A MULTIGRID APPROACH TO THE TRANSPORTATION PROBLEM 



Following traditional multigrid design approaches, we develop the necessary tools to devise a 
multigrid V-cycle, which we will combine with a nested iteration to create an FMG algorithm. In 
particular, it is necessary to devise restriction and prolongation methods, some form of local 
relaxation, and to weave them into an algorithm. 


Restriction 


To devise a restriction algorithm, it is first necessary to define a coarse grid. We use an approach 
in which each gridpoint on the coarse grid is a demand node for the coarse grid problem, and 
represents a pair of demand nodes on the fine grid. This is accomplished as follows. The demand 
nodes are first sorted by increasing cost of shipping from Si, and divided into two groups about the 
median of the sorted cost. This procedure results in two groups of demand nodes, one with a lower 
cost of shipping from Si, and one for which shipping from Si is more expensive. Each of these 
groups are then sorted according to increasing cost of shipping from S 2 and divided into two groups 
about the median cost. This results in four groups, one for which shipping is expensive from both 
supply nodes, one group for which shipping is inexpensive from both supply nodes, one group for 
which shipping is expensive from S% and inexpensive from Si, and one group where shipping is 
expensive from Si and inexpensive from S 2 . 
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Figure 2: /I simple example of a coarsening process for the transportation problem 


If there are more than two supply nodes in the problem, this process is continued. The four 
groups are each sorted by cost of shipping from supply node S3, then divided into smaller groups if 
necessary and sorted again, according to cost from £4, etc. If the groups contain more than two 
nodes after the nodes have been sorted according to cost from all supply nodes, the sorting process 
begins again with cost from Si on each of the groups. Eventually, the nodes will be sorted into 
pairs that have similar shipping costs from all supply nodes. Each of these pairs of demand nodes is 
then replaced with a single coarse grid demand node, the collection of which constitutes the first 
coarse grid. - ■■?= • r~: 

Further coarsening is accomplished by repeating the procedure described above on the coarse 
grids to produce still coarser grids. Figure 2 shows a simple example of the coarsening process. If 
the number of points on the original grid is a power of two, then in the limit a coarsest grid would 
consist of a single demand node. As in traditional multigrid methods, once the hierarchy of grids is 
established it is stored, so that the sorting process need never be repeated. 

Three quantities must be restricted when aggregating a pair of fine grid demand nodes into a 
coarse grid demand node: the demands, flows, and costs. Let D™ be the coarse grid node 
representing the fine grid nodes D j and D'f . It seems natural that the demands can be restricted 
simply by summing the demands of the two fine grid nodes to produce the demand at the coarse 
grid node, dfff = j£ h [dJ,df] = dfc + df. Similarly, the flow x^ m from any supply node Sk into the 
coarse demand node should be the sum of the flows from Sk to each of the fine grid demand nodes 
that make up the coarse grid node, = l\ fc [xjy,x* { ] = x^j + 

Restricting the cost of shipment is more complicated, and no obvious “best” approach is 
apparent. However several methods can be considered. The simplest of t hese is to define the coarse 
cost efjffn to be the minimum of the fine costs, i.e., e?^ m — Iff 1 [c£,, c£ ; ] = min(c^, c^). Other simple 
schemes are readily devised, such as using the maximum of the fine costs, or a weighted average of 
the fine grid costs. We use a weighted average of the fine grid costs. Again, there are several 
possible weightings, each having valid arguments for and against it. Three schemes were tested in 
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some depth, equal weighting, flow weighting, and demand weighting: 


Equal weighting: 

^2 h 
*^km 

= 

Demand weighting: 

r 2h 
u km 


Flow weighting: 

Ah 

^km 



4j + c k i 

2 

+ d h lC h k l 
d’f+d’f ’ 

~,h ~,h | 

x kj c kj x kl c kl 

X h> _ I rph 

kj + X kl 


With flow weighting, provision must be made for the case where there is zero flow on both arcs. In 
such a case flow weighting can be replaced with either demand weighting or equal weighting. In 
general, we found that demand weighting most consistently gave the best results, and adopted it for 
our algorithm. 


Prolongation, or Interpolation 


Suppose that the problem has been solved on the coarse grid Q 2h . We seek a method of 
prolongation, that is, a way in which the coarse grid solution can be interpolated onto the fine grid. 
In the coarse grid solution there is some quantity of flow giving the flow from each supply node 
Sk to each coarse grid demand node . Each such demand node on the coarse grid, however, 
represents the aggregation of two demand nodes on the fine grid, and rff- An interpolation of the 
coarse grid solution, therefore, can be constructed by treating the M flows into 

the coarse grid demand node df£, as supplies. Interpolation, then, consists of solving for each coarse 
grid demand node, the M x 2 transportation problem with those supply values, the two demand 
nodes d* and df, and the shipping costs c^, k — 1,2,..., M. (Figure 4 shows schematically how 
the interpolation process appears.) 

Having defined the interpolation process as finding the solutions to many small transportation 
problems, we turn our attention to the mechanism for finding these solutions. A method for solving 
such M x 2 problems is described below. The method is a special case of Vogel’s approximation 
method. 

Algorithm 1 Solving the M x 2 Balanced Transportation Problem 

1. For each supply node Sk ■ find the difference in cost of shipping 6k = | — c^| to the two fine 
grid demand nodes d'f and d 1 } . 

2. Rank the M supply nodes in decreasing order of these cost differentials , so that 

6i > 6 2 > ... > 6 M - 

3. Re peat until all supply nodes are removed from the problem: 

(a) Denote the supply node at the top of the ordered list as the “ current supply node, and 
allocate flow to the demand node with the lower cost of shipme nt . that is, along the least 
expensive arc, thus de termining a " current " demand node. (In the event that more than 
one node has the largest differential cost, select from among them the node with the 
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Figure 3: Example, problem illustrating the solution method for an M X 2 problem. 

smallest cost along one of its two arcs). Allocate flow along this arc until either the 
demand at the current demand node is satisfied or until the supply at the current supply 
node is exhausted. 

(b) If the supply at the current supply node is exhausted, remove that supply node from the 
problem. 

(c) If the demand at the current demand node is satisfied, remove that demand node from 
the problem, allocate the remaining supply from the current supply node to the remaining 
demand node, and remove the current supply node from the problem. 

4 ■ Stop. 

As an example of this procedure, consider the five by two problem shown in Figure 3. The five 
supply nodes 5i, 5 2 , . • . , 5 S have, respectively, 15, 12, 16, 18 and 14 units of the commodity to 
deliver. The demands of the two demand nodes D\ and D 2 are 30 and 45. Let 8 = (4 8 4 6 1) T be 
the vector whose i th entry is the difference <5, between shipping cost from supply node 5, to the two 
demand nodes (the costs themselves are given for each arc in the figure). Sorting from largest to 
smallest value of 8 i, the supply nodes are ordered (52, 54, Si, S 3 , S 5 ). Note that, while the - - - 
differences for nodes 5i and S 3 are the same, the cost ci 2 along the arc from node 5i to node D 2 is 
less expensive them either of the arcs incident from node S 3 . Starting with node 5 2 , then, as much 
flow as possible is sent along the least expensive arc. In this case, that is the arc to demand node 
D 2 . Since this demand exceeds the available supply from node 5 2 , all of the flow from node 5 2 goes 
along this arc. Similarly, node 5 4 and then node 5i send all of their supply to node D 2 . When node 
S 3 has sent 12 units of flow along its least expensive arc, the demand at node D\ is completely met. 
Thus node S 3 sends its remaining units to node D 2 , as does node S 5 . Although the arc from node 
S 3 to D\ is less expensive, the demand at D\ has been met from supply nodes where the difference 
in arc costs is greater. 

We can show now that because of the special structure of the M x 2 problem, i.e., the fact that 
there are only two demand nodes, this procedure produces an optimal solution. 
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Theorem 2 Let x be the vector of flows assigned for the M x 2 problem using the algorithm given 
above. Then x is an optimal solution to the M x 2 problem. 

Proof: Suppose that x is not an optimal solution. Then there exists a flow x* ^ x such that 
z * = c T x* is optimal. We will show that if x is determined by the algorithm given above and 
z — c T x, then z* > z, contradicting the assumption that x is not an optimal solution. Letting 
6k = \c ki — c fc2 | for each k = 1, 2, m . . . , M, assume the supply nodes have been ordered in 
decreasing order of 6k so that 6\ > 62 > . . . > 6 m ■ Let i be the first supply node for which x* differs 
from x, and without loss of generality, assume that < c i2 . Let A = xn — x* r We first observe 
five useful facts: 

1. Since the problem is balanced, total flow out of Sk equals the supply, so that 
s k = x k\ + x k 2 = z*i + x* k2 for every k, implying x fcl - x^ = x* 2 - x fc2 . 

2. In particular, since A = x^ — x*j then —A = x’ 2 — x i2 . 

3. Since the problem is balanced, total flow into D\ equals demand, so that d\ = x ji and 
di — Ejli x k\ ■ Subtracting these two relations, and noting that x and x* do not differ for 
j = 1, 2 , . . . , i — 1, we find that 0 = xn — x*j 4- X^ij + i(x_,i — x ]i), implying that 

^ = ^2j=i+li x jl ~ x jl)- 

4. Using similar reasoning, we obtain —A = X/Jl i+1 (x* 2 — Xj 2 ). 

5. By construction, since cn < c i2 , then xn is as large as it can possibly be, so that if x and x’ 
differ for node i then Xu > x* x , implying that A > 0. 

Next, we observe that 

= Z + Z* - Z 

= z + c T x* — c T x 

M M 

= Z + (x^Cji + X* 2 C j2 ) - 53 ( x jl C ]l + x j2 C j2 ) , 

3—i j=i 

where we have used the fact that x* and x do not differ for j < i. Separating the flows from supply 
node i, we can write 

M M 

z* = z + (aft - Xji) Cji + (x* 2 - x i2 ) Cj 2 + 53 (x*! - x n ) Cjl + 53 (x* 2 - X j2 ) C j2 

j=i+ 1 
M 

= z - A Cji + A c i2 + 53 ( x ii “ x Ji) ( C J1 - c i 2 ) 

i=*+i 

where we have used the fact that x k \ — x kl = x* k2 — x k2 for all k. Recalling that 6 k = |c*i — c fc2 |, 
and that since Cji < Cj 2 then 6i = Cji — Cj 2 , we observe that 

M 

z- > z + AS i+ £ (*;, - (-<,). 

i=t+i 
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Figure 4: 

An illustration of interpolation by local optimization. The flows in the 3x2 coarse grid problem are 
the supplies for the two 3x2 local problems. The combination oj the flows solving those two local 
problems makes up the interpolated solution to the 3x4 fine grid problem. 

Since the nodes are ordered in decreasing order of £*, we know that —Si < —6j for all j > i, and 
therefore 

M 

z* > z + A Si — Si 53 ( ~ ^ji) - 

j-i+1 

Finally, recalling that A = fZ^Li+i{ x ’ji ~~ *ii)> we obtain 

z* > z + A Si — Si A. 

Therefore z* > z, contradicting the assumption that x is not an optimal solution. 

I 

Relaxation 

Suppose that we have solved the coarse grid problem, and have interpolated that solution by 
solving an M x 2 transportation problem for each coarse grid demand node in order to pass the 
local solution to the two fine grid demand nodes represented by each coarse grid node. The supplies 
for this local M x 2 problem are the coarse grid flows. Figure 4 displays a schematic showing how 
the interpolation process appears graphically. 

It is important to note that while each of the local M x 2 problems has been solved optimally, 
there is no reason to expect that the total set of fine grid flows thus assigned will be optimal. For 
this reason, it is essential that we devise some kind of <l relaxation” scheme, whose task is to smooth 
or correct errors left by the interpolation scheme. 
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When two M x 2 subproblem solutions are viewed from a more global perspective, as a solution 
to an M x 4 problem (or as a portion of a solution to a still larger problem), this combination of 
locally optimized solutions may be flawed, in that too many arcs may have flow on them. This is 
because the minimum value of the objective function for a balanced transportation problem can 
always be obtained with a flow regime having flow on at most M + N — 1 arcs. This simply reflects 
the fact from linear programming theory that an extreme point solution has flow on M + IV — 1 
arcs, if the solution is non-degenerate [9], and that an optimal solution can always be found at one 
of the extreme points (a degenerate solution is one in which distinct subsets of demand nodes axe 
supplied by distinct subsets of supply nodes). If the solution is degenerate, there will be fewer arcs 
with flow on them. For example, if N > M , then in the extreme degenerate case each supply node 
provides flow to a disjoint subset of the demand nodes. This means that each demand node has 
exactly one axe with flow on to it, so that precisely N arcs have flow. If N < M, then the extreme 
degenerate case is when each supply node has exactly one axe with flow, giving M such arcs. 

When interpolating from Q 2h to Cl h , each coarse demand node generates two fine grid demand 
nodes and the optimal solution to the M x 2 subproblem has M + 1 arcs with flow, in the 
non-degenerate case. If the subproblem solution is degenerate, then M arcs have flow. If there axe 
N/2 demand nodes on Q 2h , then after the interpolation the collection of subproblem solutions 
(viewed as the initial feasible solution to the problem) will have flow on at least NM /2 and at 
most ( NM + N)/ 2 arcs, depending on how many subproblems are degenerate. 

Thus, whenever NM/2 is greater than M + N — 1, (which is true for any long transportation 
problem where M > 2 and N > 3), the collection of local solutions has too many arcs with flow to 
be an extreme point solution to the fine grid problem, and is probably less than optimal. The local 
relaxation scheme developed here is designed to reduce the number of arcs with flow for the fine 
grid problem, which will generally have the effect of moving the global solution toward an optimal 
solution. 

The mechanism by which we do this is cycle removal. Since there are M + N — 1 arcs in a 
spanning tree over M + N nodes, and the addition of a single arc (or more) to a tree results in a 
graph with at least one cycle, then for most problems, the interpolation process will introduce 
cycles. We note that while this has been developed in the setting of the entire collection of local 
solutions, it is also true in a pairwise sense. That is, each of two M x 2 local solutions will have 
either M + 1 or M axes with flow. Viewing the pair as a solution to an M x 4 problem, we observe 
that the combined solution will have at least 2 M arcs with flow. If M > 2 this equals or exceeds 
the M - |-3 arcs with flow that would be present in an extreme point solution. 

To illustrate this, consider the possibilities when two 3x2 local solutions axe combined into a 
3x4 solution, as shown in Figure 5. In a), two non-degenerate solutions axe combined. Numbering 
the demand nodes of the combined problem clockwise from the upper left and the supply nodes 
from top to bottom, we observe that there are three cycles in the combined solution 
(Si, D 3 , S 2 , D\, Si), (S\, D 3 , S 3 , D 4 , S 2 , Di, S\), and (S 2 , £> 3 . £31 £* 4 , <S 2 )- In 6 ), a degenerate solution 
is combined with a non-degenerate solution, yielding a combined solution with one cycle. In c), two 
degenerate solutions are combined into a solution that has no cycles, while in d), two degenerate 
solutions axe combined into a solution that has one cycle. 

A reasonable candidate for a local relaxation process is to adjust the flow in the initial solution 
produced by the interpolation process, so that cycles axe removed and the objective function is 
reduced. The effect of this procedure is to adjust the locally optimal flows which result from 
interpolation so that they are more nearly optimal in the global problem. 
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Figure 5: Combining two 3x2 solutions into a 3 x 4 solution. Four typical cases are shorn. 

Cycles are detected in the algorithm using a depth first search (DFS). The DFS proceeds as 
follow s : 


Algorithm 2 Depth First Search . * j j. . 

1. Initialize all nodes with DFS number 0, to indicate they have not yet been visited. 

2. Start at any node. Assign this node a DFS number of 1, and define node 0 to be the 

predecessor of this node. ......: . • 

3. If any node adjacent to the current node has been previously visited, and has a DFS number 
lower than the predecessor of the current node, then the path from that node through the 
current node and back is a cycle. Stop DFS and call the cycle removal routine . 

f. If no adjacent nodes have lower DFS numbers, then look for any adjacent nodes winch have 
not been visited. If there are any unvisited adjacent nodes, identify the current node as the 
predecessor of the un visited node, make the unvisited node the current node, and assign the 
current node a DFS n umber equal to the DFS number of its predecessor plus 1. 

5. If there are no unvisited nodes adjacent to the current node, make the predecessor of the 

current node the current node. If the current node is node 0, stop. Otherwise, return to step 3. 


Once a cycle is detected, a cycle removal algorithm is used to adjust the flows. The technique is 
illustrated in Figure 6. The e ffect of a unit inc rea se in flow in the clockwise direction around the 
cycle is determined by adding together the costs of the arcs whose flow increases and subtracting 
the cost of the arcs whose flow decreases. The change in obje ctive function value per unit change in 
flow in one direction will be the negative of the change in the opposite direction. An example is 
shown in Figure 6, with the initial flow regime on the left, and the flow after cycle removal on the 
right. The supplies and demands are shown in the boxes and circles, while the numbers in 
parentheses above each arc give the cost and flow for that arc. For example, the cost C34 is 5, while 
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Figure 6: An initial solution with a cycle (left), and the improved flow regime after cycle removal 
(right). The numbers in parentheses above each arc are ( Cij,Xij ). 

there is initially 3 units of flow on that arc. A unit increase in flow clockwise around the cycle 
(beginning at Si) will cause a change in the objective function value of 4 — 6 + 3 — 8 + 5 — 6 = —4, 
a net decrease. A unit increase in the counter-clockwise direction therefore yields 6, a net increase 
in the objective function. Clearly, increasing the clockwise flow is profitable, so flow is increased in 
this direction. Flow values will thus be increased for xi 3 , x 2 2 and £34, while flow is decreased for 
x 2 3, X32, and X14. That is, flow is increased on the arcs in the cycle which point in the profitable 
direction, and decreased on the the other arcs, until one of the decreasing arcs reaches zero flow. At 
this point, the cycle has been removed, and the value of the objective function has been decreased. 
In Figure 6, increasing the flow clockwise around the cycle by four units breaks the cycle by 
eliminating flow along X14, and reduces the value of the objective function by 16 units. The 
improved flow regime is shown on the right side of the figure. 

This technique is used as a local relaxation method by applying it to pairs of subproblems. Two 
subproblems which are adjacent in cost space are joined to form an M x 4 problem, which is 
inspected for cycles. If any are found, they are removed and the problem is searched again. 

Two different methods for applying this technique are investigated. The first, termed total 
relaxation, is to join adjacent pairs of M x 2 problems, remove the cycles, then repeat the process 
by joining adjacent M x 4 pairs, removing the cycles, then to join Mx8 problems, and so on, until 
the global problem for the current level is inspected and certified to be cycle-free. This approach is, 
however, extremely expensive. The second approach only employs a local relaxation, and is therfore 
true to multigrid principles. In this case, only pairs of M x 2 are checked for cycles. The gain in 
speed from using this second method is significant, while the decrease in accuracy is negligible (see 
Table 1 in the next section). 

EXPERIMENTAL RESULTS AND CONCLUSIONS 

The algorithm employed in this work is an FMG algorithm, using demand- weighting as the 
restriction method for computing costs, interpolation by local optimization, and local relaxation by 
cycle removal. The results are displayed in Table 1. While the multilevel algorithm performed well 
on problems with only two or three supply nodes, the results for the five supply node problem are 
unsatisfactory. The table clearly indicates that relaxation by cycle removal is as effective when 


73 






applied over a local area as when applied globally, and the computational effort required for local 
relaxation is an order of magnitude smaller. 


Problem 

Size 

Relaxation 

Method 

Run Time 

% Above 
Optimality 

2 x 1024 

Total 

1.210835 

0.02 % 

2 x 1024 

Local 

0.131437 

0.02 % 

3 x 1024 

Total 

1.15788 

8.41 % 

3 x 1024 

Local 

0.108765 

8.41 % 

5 x 1024 

Total 

1.18411 

58.4 % 

5 x 1024 

Local 

0.106392 

59.7 % 


We note also that this algorithm is not now competitive with the state of the art in network flow 
optimization methods. No numerical data are available as a careful comparison has not been made, 
however, some rough comparisons indicate that much remains to be done before a competitive 
algorithm could be obtained. ' 

The most significant contribution of the current research is the removal of the requirement for a 
physical interpretation of the problem, and the dependence on a relationship between distance and 
shipping costs. By mapping the problem into cost-space, a multilevel approach can be applied to a 
much broader class of problems. Of course, there is a limit to the number of supply nodes which 
this approach can handle, due to the increasing dimensionality of the problem. However, for 
problems with few supply nodes, this approach can be helpful. We predict that further work will 
yield the result that problems which have either a very small number of supply nodes, or a 
geometrical interpretation, can be solved to within an acceptable degree of optimality using a 
multilevel approach. However, problems which do not meet either of these criteria probably cannot 
be solved with currently known multilevel methods. 


Further Research 


An algorithm analogous to the full approximation scheme (FAS) should be developed. In the 
current work, we were unable to find an effective method of extracting a correction from the 
solution on Q 2h and applying it to the approximation on Q h , while still maintaining feasibility. 
Instead, we compute the solution on tt 2h and use interpolation to replace the solution on Q h . Since 
a direct analog to the residual in a PDE is unknown for in an optimization problem, FAS is likely 
the method of choice, however, the difficulty mentioned above must be overcome. 

Another possibility for improving this algorithm is to begin the procedure by overlaying the 
cost-space with a regular M-dimensional grid. The first step of the restriction process would then 
be to map the demand nodes from their natural irregularly spaced positions in cost-space to the 
regular grid points. Later, the final interpolation step would be to transfer from the regular grid 
back to the original demand points. This approach overcomes a shortcoming in t he current 
algorithm, which aggregates demand nodes which are closest in relative distance in cost-space, 
regardless of the absolute distance between them. In using a regular grid, a demand node on Q 2h 
would reflect only the demand at nodes a distance of 2 h or less away from it. Another important 
potential advantage is that the work on each coarser level is reduced by 2 -M , instead of by one half 
as in the current research. If the regular grid approach proves worthwhile, then it could be 
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extended to a fast adaptive composite (FAC) grid approach. In a network optimization setting, this 
might be done by overlaying a fine grid on those regions of cost-space where the density of demand 
nodes is high, and a coarser grid on the areas of low density. In this way, the flow to nodes which 
are most similar to their nearest neighbors in cost-space will receive the benefit of a finer grid 
spacing, while nodes which are naturally more distinct from their neighbors will only enter the 
problem on the coarser levels. 
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SUMMARY 


The finite strip method is a semi-analytical finite element process which allows for a discrete 
analysis of certain types of physical problems by discretizing the domain of the problem into finite 
strips. This method decomposes a single large problem into m smaller independent subproblems 
when m harmonic functions are employed, thus yielding natural parallelism at a very high level. 
In this paper we address vectorization and parallelization strategies for the dynamic analysis of 
simply-supported Mindlin plate bending problems and show how to prevent potential conflicts in 
memory access during the assemblage process. The vector and parallel implementations of this 
method and the performance results of a test problem under scalar, vector, and vector-concurrent 
execution modes on the Alliant FX/80 are also presented. 


INTRODUCTION 


More and more parallel computers have been developed and made available to the engineering 
and scientific computing community in recent years. To take advantage of current and future 
advanced multiprocessors, however, a great deal of efforts remain to be made in the search for effi- 
cient and parallel implementations. In this paper we address both the coarse-grain and fine-grain 
parallelism offered by the finite strip method (FSM) for the dynamic analysis of Mindlin plate 
bending problems and present our vector and parallel implementations on multiprocessors with 
vector processing capabilities. FSM, first developed in the context of thin plate bending analysis, 
is a semi-analytical finite element process [6, 22]. This method allows for a discrete analysis of 

’This work was supported by the U.S. Department of Energy under Grant No. DOE DE-FG02-85ER25001 
while the authors were with the Center for Supercomputing Research and Development, University of Illinois at 
U rbana-Champaign . 
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Figure 1: The coordinate system and sign convention. 

certain types of physical problems by discretizing their domains into finite strips, involving an ap- 
proximation of the true solution using a continuous harmonic series in one direction and piecewise 
interpolation polynomials in the others. Because of the orthogonality properties of the harmonic 
functions in the stiffness and mass matrix formulation, FSM decomposes a problem, when appli- 
cable, into many smaller and independent subproblems which yields coarse-grain parallelism in an 
extremely easy and natural way. 

Although not as versatile as the finite element method, FSM has been applied to a wide range 
of plate, folded plate, shell, and bridge deck problems [4, 6, 7, 8, 10, 18] because of its efficiency 
and simplicity. The performance induced by the coarse-grain parallelism of this method in a 
multiprocessing environment has been shown in [9] for the static analysis of Mindlin plate problems 
and in [20] for groundwater modeling. In this paper, we report and compare the performance 
results of our implementation for the dynamic analysis of a simply-supported rectangular Mindlin 
plate using scalar, vector, and vector-concurrent execution modes on an Alliant FX/80. 

THE PROBLEM 


In this section we describe briefly the mathematical modeling of Mindlin plate problems [17]. 
Let Cl be the space domain in 9? 2 , T the boundary, and T the time domain. Let also the stress 
resultants, generalized strains, displacements, dynamic surface loadings, and inertia forces be 
denoted respectively by s, r, d, p, and q: 
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where p stands for the mass density (per unit volume), h the thickness of the plate, and v (v = 
w, 0 X , or 9 y ) the second derivative of v with respect to time t: v — d 2 v/dt 2 . The subscripts x, y, 
and z above represent the directions in the Cartesian coordinate system. The sign convention for 
the displacements and external loadings is shown in Figure 1. Neglecting the damping effect of 
the plate, the differential equations which govern the state of stress resultants, generalized strains, 
and displacements in an elastic plate can be expressed as 


1. Equilibrium equations: Lj s + p + q = 0 in Ll®T, subject to some appropriate 
boundary conditions on T, 

2. Stress-strain equations: s = Dr, and 

3. Strain-displacement equations: r = L^d. 

Here D is the material property matrix of an elastic plate. L\ and Li are the differential operators: 


and 


L] 


0 0 0 d/dx d/dy 

d/dx 0 d/dy —1 0 

0 d/dy d/dx 0 —1 



0 0 0 d/dx d/dy 

—d/dx 0 —d/dy —1 0 

0 -d/dy -d/dx 0 -1 


where the superscript T denotes the transpose of a matrix. 


( 1 ) 

(2) 


For orthotropic material, the matrix D takes the form 


D = 


D x D\ 

D\ Dy 

D X y 
OlCj x 

*Gy 


( 3 ) 


where D x , D\, ..., G y are the standard flexural and shear rigidities of plates and a is a modification 
coefficient to account for the deviation of shear strain distribution from uniformity [4] (a = 5/6 for 
rectangular cross section; see [21, p. 371]). The rest of the entries in D are zero. If the material 
is isotropic, then the nonzero entries take the following values: 


D x - Dy 


Eh 3 

12(1 - z/ 2 )’ 


D\ — vD x , D X y 


1 - v 

2 


D x , 


and 


G X — Gy = 


Eh 

2(l + i/) 


where E , h , and v represent the material modulus, plate thickness, and Poisson’s ratio, respec- 
tively. The total potential energy of the plate due to the dynamic surface loading p [17, 16, 14] 
can be written as 


n = r (- [ (Lid) T D(L 2 d) dCl - [ p T d cm - i / d T Ad dtl) 
Jo V2 Jn J n 2 Jn ) 


dt 


( 4 ) 


where d = dd/dt and A = diag 


—ph, j^ph 3 , j^ph 3 ], a diagonal matrix. 


A STRIP ELEMENT FOR MINDLIN PLATES 

We now outline the FSM formulation for the Mindlin plates using linear elements [4, 19]. 
We shall confine our discussions to rectangular Mindlin plate problems simply supported on two 
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opposite sides. Figure 2 shows a rectangular plate discretized into n — 1 finite strips. The plate is 
assumed to be simply supported on edges y = 0 and y — L y . Shown in Figure 3 is the mid-plane 
of a typical linear strip plate element of constant thickness h, whose local coordinate system is 
denoted by (x', y 1 , z ') where x' = x - x„ y' = y, and z' = z. Let 0 (e ) be the domain of the e th 
strip element and i and j be the two longitudinal edges (nodal lines) of the element, as shown in 
Figure 3. Let d( e )(x,y,t) and u j e j(f) be defined as 

d( e )(^J/?0 “ 0 x {x 5 y 5 0y{x •> y ity] 5 £ n w 


and 


u 


(e) 


(0 = 


1 

C-K 
1 


l »5(0 J 



<*>!{<) tJi(t) «v!(<) I «>'(*) 


where w ■{(*) denotes the I th harmonic coefficient (amplitude) of u>,(y,t) which is the displacement 
along edge i, etc. For a linear strip element with m harmonic terms specified, the approximation 
to d( e ) is given [4, 18] by 

d (e )(x, 2/, 0 » £ F'(x, y)uj e) (0 (5) 

/=! 


with 
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where S, and C, are the 
of x, defined by 


I th harmonic functions of y , and N, and Nj are the linear shape functions 


. h y tey 
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Figure 3: A typical plate strip element. 

where r( e ), ranging from -1 to 1, is the natural coordinate in x-direction of the e th element. 
Note that ri e \ = — 1 + 2 Z ~ J| for the element shown in Figure 3. It should be observed that the 
approximation to the displacement vector in (5) satisfies the simply supported boundary conditions 
on edges y = 0 and y = L y ; i.e., w, 9 X , dw/dx , dd x /dx, and dO y /dy all vanish on these two edges. 
The dynamic surface loading on the e th element, p( e )(x, y, <), can often be approximated by the 
sum of a harmonic series in the longitudinal direction as shown below 

m 

P(e)(z, !fd)«E H'foJpJejfo 0 ( 6 ) 

/=1 

where H* = diag [Si, S h Cj] and p[ e) = [<?' m l x mjj ^ . The subscript (e) outside the brackets 
indicates that every component of the vector is associated only with the e th element. 

Following the standard finite element procedure and taking advantage of the orthogonality 
properties of the harmonic functions, we obtain a linear algebraic differential system of block 
diagonal form [5] depicted by: 

Mii + Ku = f (7) 

where 

M = M 11 © M 22 ® • • • ® and K = K n © K 22 © ■ • • © K mm 

are block diagonal matrices of the same block structure. The vectors u and f are accordingly 
partitioned, 

u T =[(u‘) T (u 2 ) T ••• (u”) r ] and f T =[(f 1 ) T (f ! ) T (r) T l- 

In (7), the symbol ® stands for the direct sum of square matrices. M J; , K", u J , and f' are the 
system mass matrix, system stiffness matrix, system displacement amplitude vector, and system 
load amplitude vector due to the I th harmonic mode, respectively. In the rest of the paper, we 
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shall drop the term amplitude and simply call u' (f 1 2 3 4 ) the I th system displacement (load) vector 
for brevity. M" is assembled from the strip mass matrix M^, K" from the strip stiffness matrix 
K^j, and f ; from the strip load vector f^ c j where 

Mfe) = fa (F') T AF l dQ [e) , 1=1, m, ( 8 ) 

Jl '(e) 

K“ = f (L 2 F ! ) r D(Z 2 F')dO (e) , 1=1, m, (9) 

f(e) = L (F') T H'p} e) dO (e) , l = 1 , m. ( 10 ) 

^i£( e ) 

For a plate discretized with n nodal lines, and M ;/ are square matrices of order 3 n for each /. 
(Kf c) and Mj' e) are of order 6 .) Once the entire system stiffness matrix K, system mass matrix 
M, and system load vector f are assembled and the boundary conditions imposed, the remaining 
major work is to solve the linear algebraic differential system (7) for u, u, and ii. 

PARALLEL AND VECTOR IMPLEMENTATIONS 

Computational Procedure. Similar to the finite element method, FSM normally consists of 
the following three main computational components: ( 1 ) the generation of strip stiffness/mass 
matrices and strip load vectors for all strip elements, ( 2 ) the assemblage of the entire system 
stiffness/mass matrix and system load vector, and (3) the solution process of the resulting linear 
differential system Mu + Ku = f. There are many step-by-step integration methods available 
for solving the 2 nd-order linear differential equations. Among them are the central difference, 
Houbolt, Wilson 6, and Newmark /? methods. The central difference method is an explicit scheme 
and the other three are implicit. Regardless of whether the method employed is implicit or explicit, 
the procedure basically involves an initial calculation of an effective coefficient matrix and then 
solves an effective linear system, after an effective load vector is formed, at each time step. In this 
paper, we employ the Newmark integration method whose procedure is shown below, where a 0 , 
a\, •••,07 are the Newmark integration constants [3, pp. 311]: 

( 1 ) initial calculation of the effective stiffness matrix K = K + a 0 M, the factorization 
of K into LL t or LDL t form, and then for each time step t*+i, k = 0, 1, ■•• 

( 2 ) forming the effective load vector f attime 4 + i: ffc+i = f* + i +^( 00 ^+ 0 ^+0311*), 

(3) solving the effective linear system at time tk+i' K ,( ujt + i = f* + i, 

(4) calculating the acceleration and velocity vectors ii* + i and u*+i: 

ii* + i = oo(u* + i — u*) — o 2 u* — 0311*, u*+i = u* + a&i i* + a7ii* + i. 

Note that the first step need be performed only once. The last three steps, however, must be 
performed at every time step and therefore constitute the most time-consuming part in the entire 
analysis. 
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To address the parallel implementation of FSM, we should first employ the decoupled structure 
of the system stiffness matrix depicted by (7), due to the orthogonality properties of harmonic 
functions. This decoupling leads to m independent sets of differential equations. Therefore, solving 
(7) is equivalent to solving 

M ,; u' + K"u' = f ; , 1=1, m 

where K“ and M n , / = 1 , - - • , m, are block tridiagonal matrices with each block of order only 
3 x 3 for the ordering shown in Figure 2. Furthermore, each M ; ' consists of only three nonzero 
diagonals. Since there is no data dependency among these m subsystems, not only can the 
generation of M[' e) , K{' e) , and f( e) and the assemblage of M“, K", and f' for each harmonic term be 
performed independently, but all the subsystems can be solved in parallel. In a parallel computing 
environment with parallelism of two levels (considering vectorization as the first level), this special 
feature leads FSM to a fully parallel! zable approach when the number of harmonic terms matches 
the number of processors. The following pseudo- Fortran code outlines its computational procedure 
and indicates where parallelism can be exploited for vector/con current executions. 


C — Initial calculations 
DO 200 1=1, m 

DO 100 e= 1, N s 

Generate K|^, M” e ), and f^ e j 
Assemble K", M", and f' 

END 100 

Initialize u', u ( , and ii' 

Form K i; from K 11 and 
Factorize K. 11 into LL^ or LDL T form 
END 200 

C — Calculations for each time step 
DO until the last time step 
DO 400 1 = 1, m 

DO 300 e = 1, N. 


(concurrent, one CPU per iteration) 

(to be discussed) 


(vector) 

(vector) 

(vector) 


(sequential) 

(concurrent, one CPU per iteration) 
(to be discussed) 


Generate and assemble f 1 
END 300 

Form effective load vector v 
Solve K"u' = f' for u' 

Calculate u l and u* 

END 400 
DO 600 l = 1, m 

Accumulate displacements w for all strips 
END 600 
END DO 


(vector) 

(vector) 

(vector) 

(sequential) 

(vector-concurrent) 


In the above pseudo code, we neglect the step of imposing boundary conditions because they 
can be performed in the generation step. The word concurrent inside the parentheses after the DO 
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statements is used to show that all iterations in this loop may be performed in parallel, on the basis 
of one processor per iteration ; and the word vector (or vector- concurrent) indicates computations 
involved in the statement should be performed in vector (or vector-concurrent) mode whenever 
possible and desirable. Whether a vector operation is desirable depends on the startup overhead 
and the vector length of the operation. 


Data Structure and Parallelization. To allow current code restructures to automatically vec- 
torize or parallelize certain computations, the Fortran statements related to that part of compu- 
tations are usually written in the form of DO loops or array constructs . Potential memory access 
conflict must also be resolved. Therefore, the data structure of the code plays an essential role. In 
our implementations, the system stiffness matrix K and system mass matrix M are represented 
by two 3D arrays SK(l:nbk,l:n,l:m) and SM(l:nbm,l:n,l:m), respectively, where nbk ( nbm ) is the 
semi-bandwidth of K (M), n the number of equations in each harmonic term, and m the number 
of harmonic terms. It should be noted that in many situations, it is more beneficiar to interchange 
the first two dimensions of both K and M, or to concatenate the first two dimensions into a single 
dimension. The system load vector f is represented by a 2D array SF(l:n,l:m) and the vectors u, 
u, and u are similarly represented by 2D arrays SU, SV, and SA, respectively. This representation 
allows parallelization across harmonic terms to be performed in the outermost loop. It also makes 
the passing of references to subroutines an easy task. 

To serve as an example, we consider the DO 200 loop where the computations inside the loop 
are now translated into subroutines as shown below (the DO 400 loop follows the same approach). 

CVD$L CNCALL ! an Alliant directive 

DO 200 L = 1 , m ! concurrent, one CPU per iteration 

CALL GenAss (SK(1,1,L), SM(1,1,L), SF(1,L), L, n, nbk, nbm, ns, ...) 

CALL Initialize (SU(1,L), SV(1,L), SA(1,L), ...) ! Initialize u 0 , u 0 , and ii 0 . 

CALL Form (SK(1,1,L), SM(1,1,L), n, nbk, nbm, aO) ! Form K“ and overwrite SK. 
CALL Factorize (SK(1,1,L), n, nbk) ! Factorize K ,; and overwrite SK. 

END 200 

where GenAss is a subroutine performing the task of the DO 100 loop in the previous pseudo code. 
The other three subroutines are self-explanatory. In the above code, the argument ns denotes the 
number of strips N, and aO is the Newmark constant a^. Using this approach, each processor will 
have an identical local copy, automatically generated by the compiler, of the subroutines inside the 
loop and its own reference space (via the index L) in locating K /J , M ,J , and U; yielding concurrent 
execution for all harmonic terms because distinct processors will hold different values of L. This 
not only prevents memory access conflicts in performing these tasks but also enables us to use a 
single set of subroutines for all harmonic terms. The same applies to the other three subroutines as 
well. Note that the index L is also passed to the subroutine GenAss as a local variable because it 
is required for evaluating KjL, MIL, and whose dimensions should be declared inside GenAss 
and will become local variables. 
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Vectorization. To address vectorization, we now turn to the computations for a single har- 
monic term. First we note that the formation of the effective stiffness matrix K ,; and effective 
load vector r, and the calculation of u l and consist mainly of matrix-matrix (vector- vector) 
additions and matrix-vector multiplications and are thus highly vectorizable. The vectorization 
and parallelization of factorizing and solving the linear system K ,( u J = v have been under 
intensive studies; see [13, 15, 23] for example. In this paper, we shall only focus on approaches to 
vectorizing the generation of and the assemblage of K n . The generation of (f^) and 
the assemblage of M ,( (f J ) follow the same way and, thus, need not be discussed. 

There are two approaches to vectorizing the generation of Kj^. The first, referred to as 
Vectorization within a Single Strip (VSS), is to generate the entries of Kj^ in vector mode. This 
approach requires a minimal storage because Kj‘ e j for all strips can share the same storage of 
a single strip stiffness matrix, which is usually the case for most traditional finite strip or finite 
element programs. The disadvantage is that the vector length available for vectorization is limited 
by the order of the strip stiffness matrix, 6 in our case, which is rather small. In addition, the 
generation step may not even involve any loop structure because most of the Fortran statements 
may simply be assignment statements when the entries of are explicitly integrated. Therefore, 
we resort to the second approach: Vectorization across Multiple Strips (VMS). This approach 
generates the matrix entries component-wise across many different strips by employing the fact 
that each strip matrix can be generated independently of the others. It, however, requires a 
manual change in the data structure of the strip matrix in the computer program because current 
code restructures can hardly accomplish this task automatically. One way of achieving our goal 
is to add one more dimension (preferably the first dimension) to the array that stores a strip 
matrix so that the new array can store all strip stiffness matrices. For example, let EKL(1:6,1:6) 
be the array used in the VSS approach for storing a single strip stiffness matrix and be shared 
by all strips, one at a time. (For simplicity, we ignore the symmetry of the matrix.) When the 
VMS approach is employed, we can simply change EKL to a 3D array, say EKL(l:ns,l:6,l:6), so 
that the first dimension is associated with strip identifications, allowing vector execution to be 
performed across strips. Although the change in data structure may impose some programming 
difficulty in modifying an existing code, this approach indeed provides a very good way for both 
vectorization and parallelization. 


So far as the assemblage of the I th system stiffness matrix K H is concerned, both VSS and 
VMS are still applicable if potential data dependencies are avoided. Note that assemblying an 
entry of K" e j to K“ has no conflict with assemblying the other entries of the same matrix to K (/ . 
Vectorization obviously can be performed within any single strip matrix without any difficulty, 
subject to the same disadvantage of short vector length as the case in the generation step. The 
following Fortran code indicates where vectorization can be performed using VSS for assemblying 
the stiffness matrix, where the rows of SKL store the upper diagonals of the band symmetric 
matrix K i( using the Linpack format [12] with the main diagonal of K !l stored in the last row of 
SKL. 
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DO 100 1=1, NBK ! NBK (=6): Semi-bandwidth of K n 

SKL(I, 1:N) = 0.0 (vector) ! Initialization. N: No. of equations of 
END 100 

DO 300 K = 1, NS ! NS: No. of strips 

K1 = 3 * (K-l) 

DO 200 J = 1, 6 

J1 = K1 4- J 

II = NBK - J + 1 

SKL(I1:NBK, Jl) = SKL(I1:NBK, Jl) + EKL(1:J, J) (vector) 

! Vector length too short. 

END 200 
END 300 

Care, however, must be taken when the VMS approach is employed for assembling K". This is 
because different strips may have some nodes in common, which amounts to saying that the entries 
of K[' e) from different strips may contribute themselves to the same location in K". Therefore, in 
order to vectorize the assemblage of K“ from K[' e) across multiple strip elements, we must find 
a way to avoid potential simultaneous updates of a common matrix entry. A general approach 
to avoid this situation is to use graph coloring techniques to partition strips so that all strips in 
the same group do not contain any common nodes. For our plate problems under consideration, 
two colors are enough: one for odd strips and the other for even strips. When a natural ordering 
is imposed as shown in Figure 2, however, a better approach to enhancing vectorization can 
be employed by assemblying entries component-wise (or node-wise) across all strip elements as 
shown below, assuming the i th strip starts from nodal line i to nodal line i + 1 and all strip stiffness 
matrices are available. 

DO 100 1=1, NBK ! NBK (=6): Semi-bandwidth of K H 

SKL(I, 1:N) = 0.0 (vector) ! N: No. of equations of 
END 100 
DO 300 J = 1, 6 

JS = 3 * (NS-1) + J 1 NS: No. of strips 

DO 200 I = 1, J 

ij = NBK - j + 1 : \ 

SKL(IJ, J:JS:3) = SKL(IJ, J:JS:3) + EKL(1:NS, I, J) (vector) 

END 200 
END 300 


Note that the array EKL now has one dimension more than the one used in the previous code. 
The storage can be reduced by about half if symmetry of the matrix is taken into account. Finally, 
we would like to mention that for a cluster-based multiprocessor with parallelism of three levels 
like the Cedar [11], FSM is a perfect candidate because the decoupling at the system level offers 
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Figure 4: The triangular loading (uniformly distributed on the entire plate). 

a great deal of freedom for the problem to be solved using all levels of parallelism. For example, 
we need exploit only the first two levels of parallelism in a linear system solver instead of three 
because the highest level of parallelism can be employed across multiple linear subsystems. 


NUMERICAL EXPERIMENTS 

To demonstrate the effectiveness and parallelizability of FSM, we consider the dynamic Mindlin 
analysis of a thin steel plate that is simply supported on all of its four edges and is subject to a 
uniformly distributed triangular loading q(t) as shown in Figure 4. This plate, adapted from [2], 
is 60 inches ( L x ) wide, 40 inches ( L y ) long, and one inch thick throughout the entire plate. The 
material of the plate is assumed to be isotropic with Young’s modulus E = 30 x 10 6 ksi, Poisson 
ratio v = 0.25, and a mass density of m = 0.00073 lb-sec 2 /in 4 . The time step size A t is set 
to 0.00001 sec. In evaluating the strip stiffness matrices, reduced integration with one Gaussian 
point is used to overcome the shear locking behavior [18]. The strip mass matrices are evaluated 
using the consistent mass approach. The linear algebraic differential equations are solved using 
the Newmark integration method with parameters a = 0.25 and 8 = 0.50 [3, pp. 311]. A banded 
direct solver is used to solve the resulting linear subsystems in each time step. 

In Figure 5, we compare the numerical solution of the displacement w at the center of the plate 
using 16 Mindlin strip elements with the exact solution (Fourier series) derived from the Kirchhoff 
thin plate theory. Eight harmonic terms are used in the finite strip approximation. From Figure 
5, it is clear that the finite strip solution is in good agreement with the exact solution of the 
Kirchhoff theory. The performance of this method on an Alliant FX/80 is shown in Tables 2 and 
3. In Table 2, we compare the CPU time (all in seconds) consumed in the entire analysis, including 
the generation, assemblage, and solution of the linear algebraic differential equations and finally 
the calculation of the displacements. Three different execution modes: scalar (S), vector (V), and 
vector-concurrent (VC) are considered. The compiler options [l] used for these modes are shown 
in Table 1. 

Table 2 shows the vector speedup (the ratio of the 1-processor CPU time spent under the 
S mode to that under the V mode) for the entire process. As seen from this table, the vector 
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Table T. Compiler options 


Execution mode 

Compiler options 

Subprograms compiled 

Scalar (S) 

-Og -AS -pg 

the entire program 

Vector (V) 

-Ogv -AS -pg 

the entire program 

Vector-Concurrent 

(VC) _ 

-Ogv -AS 
-Ogvc -AS 

recursively-called subroutines 
others 


Table 2: CPU time (in seconds) and vector speedup on the Alliant FX/80 using one processor. 


Step 

Scalar (S) 

Vector (V) 

S/V 

Remark 

Solve LDL t u = f 

177.1 

137.1 

1.29 

semi-bandwidth too small 

Compute f, u, u (Newmark) 

91.0 

25.3 

3.60 

mainly DAXPY operations. 

Generate f ( ' e) and assemble f 

42.7 

12.4 

3.45 

using the VMS approach 

Initialization and I/O 

1.72 

1.70 1 

1.01 

no manual optimization 

Total 

312.4 

176.4 

1.77 
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Table 3: Parallel performance under the vector-concurrent mode. 


No. of processors k 

1 

2 

4 

8 

CPU time in seconds 

165.7 

84.14 

45.01 

25.08 

Concurrency speedup Sk 

1.00 

1.97 

3.68 

6.61 

Efficiency Ek (%) 

100.0 

98.5 

92.0 

82.6 


Concurrency speedup 



Figure 6: Concurrency speedup on the Alliant FX/80. 



speedups for the three most time-consuming parts: (1) solving K« = f, (2) computing f, u, and 
ii, and (3) generating f{ e ) and assemblying f are 1.29, 3.60, and 3.45, respectively. Note that 
the semi-bandwidth of the system stiffness matrix is only 6 in this example, which is obviously 
not long enough for a banded direct linear system solver to take advantage of vector instructions 
in solving the linear system. The vector speedups for the other two parts, however, are very 
satisfactory. It deserves mentioning that in generating and assemblying f, we employed the 
VMS approach which yields a much better vector performance than the VSS approach. Table 3 
shows the concurrency speedup Sk, defined to be the ratio of the CPU time spent under the VC 
execution mode of the entire program using only one processor to that using k processors and 
the efficiency Ek (= Sk/k ), the ratio of the concurrency speedup Sk to the number of processors 
k. Figure 6 plots the speedup against the number of processors used. As seen from Table 3, the 
concurrency speedups observed using 2, 4, and 8 processors are 1.97, 3.68, and 6.61, respectively. 
This impressive performance clearly indicates the parallelizability of FSM on multiprocessors when 
the number of harmonic terms used matches the number of processors available. 

CONCLUSIONS 

The effectiveness and parallelizability of the finite strip method (FSM) for the dynamic analysis 
of a class of Mindlin plates have been addressed and vector/parallel implementations presented. 
The performance of this method on the Affiant FX/80 has also been tested using a rectangular 
plate that is simply supported on all edges and is subject to a uniformly distributed triangular 
loading. From the experiments performed, we have obtained concurrency speedups of 1.97, 3.68, 
and 6.61 using 2, 4, and 8 processors, respectively. These speedups are satisfactory and very 
encouraging. It clearly demonstrates the superiority of FSM in a parallel processing environment. 
For vectorization, good performance has also been observed for the Newmark integration scheme 
and for the generation/assemblage process using the VMS (vectorization across multiple strips) 
approach. In summary, we conclude that, although vector performance during the solution stage 
may be hindered by the small semi-bandwidth of the subsystems if a direct solver is employed, FSM 
is highly parallelizable and, therefore, suitable for computation on multiprocessor or multicluster 
computers. This is especially true when the problem requires a large number of harmonic terms 
to yield accurate results. 
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NONCONFORMING FINITE ELEMENT SPACES OF LAGRANGE-TYPE* 
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Houston, Texas 

SUMMARY 

In this article, we consider the application of three popular domain decomposition methods to 
Lagrange- type nonconforming finite element discretizations of scalar, self-adjoint, second order 
elliptic equations. The additive Schwarz method of Dryja and Widlund, the vertex space method of 
Smith, and the balancing method ol Mandel applied to nonconforming elements are shown to 
converge at a rate no worse than their applications to the standard conforming piecewise linear 
Galerkin discretization. Essentially, the theory for the nonconforming elements is inherited from the 
existing theory for the conforming elements with only modest modification by constructing an 
isomorphism between the nonconforming finite element space and a space of continuous piecewise 
linear functions. 


INTRODUCTION 

We consider the convergence properties of domain decomposition methods applied to 
Lagrange-type nonconforming finite element discretizations of scalar, self-adjoint, second order 
elliptic problems. An isomorphism between the nonconforming finite element space with the 
natural norm induced by the elliptic problem and a conforming piecewise linear space with the 
H l -seminorm is constructed. Using the isomorphism, we are able to apply the existing analysis of 
domain decomposition methods for conforming elements to nonconforming elements with only 
modest modifications. As examples of this technique, we show that the operators arising in three 
popular domain decomposition methods, specifically, the additive Schwarz method of Dryja and 
Widlund [1], the vertex space method of Smith [2], and balancing method of Mandel [3], applied to 
nonconforming finite elements have condition numbers that satisfy the same bounds as the ones 
given in [4] and [5] for conforming finite elements. 

The same technique was used in [6] and [7] to analyze the rate of convergence of balancing 
domain decomposition and the standard additive Schwarz method for the dual- variable mixed finite 
element formulation. Moreover, as a corollary of the analysis of Smith’s method for the 
nonconforming spaces presented in this paper, we have a new bound for Smith’s method applied to 
mixed finite elements. 

‘This work was supported in part by the National Science Foundation under Grant No. DMS-91 12847. 
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After the research for this paper was completed, the author was made aware of some related 
work done concurrently by Sarkis [8]. In particular, the isomorphism used herein was independently 
suggested by Sarkis for linear nonconforming elements. In [8], Sarkis constructs and analyzes 
special coarse spaces such that when the overlapping additive Schwarz method is applied, the 
condition number of the resulting operator is bounded by a constant times (1 + \og(H/h))(l + H/S ) 
in both two and three dimensions. Here H and h are the characteristic sizes of the subdomains and 
mesh, respectively, and 5 is a measure of the overlap of subdomains. The notable characteristic of 
Sarkis’ bound is that the constant is independent of jumps in the coefficients across subdomain 
boundaries. If the techniques of this paper were used to derive bounds that were independent of 
the jumps in coefficients, the resulting bound would include one log factor in two dimensions using 
[1, 9], but two logs in three dimensions using [5, 10, 11]. 

The remainder of this paper is divided into six sections. In the next section, we set some 
notation, formulate the nonconforming problem, and construct an equivalent representation in 
terms of the nodal values. In Section 3, we construct an isomorphism between the nonconforming 
space and a continuous space of piecewise linear functions. The isomorphism is used in Section 4 to 
analyze the rate of convergence of the Dryja-Widlund additive Schwarz method. In the last three 
sections, we consider the substructuring methods of Smith and Mandel applied to the 
nonconforming problem. 


PRELIMINARIES 


We consider the following self-adjoint, uniformly elliptic problem for p on the polygonal domain 
ft C El 71 , n = 2,3, with boundary 3ft: 

-V-AVp=/ in ft, p = 0 on 3ft, (1) 


where A is a uniformly positive definite, bounded, symmetric second order tensor, and / G T 2 (ft). 
The uniform ellipticity of (1) implies the existence of positive constants c*,c* such that the 
following bound holds: 

C ^ T ( < ( T A(x)t < ci T i ve g nr,v* g ft. (2) 


In order to set a length scale, we assume that the diameter of ft is one. We introduce a two level 
quasi-regular triangulation of ft: a division first into subdomains {ft*}^i with diameter 0(i?), and 


i/2,an, — M1/2, an, + ^ IMIo.an, 


a refinement of the first into elements with diameter 0(h) 

norms 

IMIi,n, — 

Mi, a + Jp\\ u \ 

io.n.i IMIi/2,fln, 

where 

H«llo.n, = 

J 

[ \u(x)\ 2 dx, 
m, 

IMIo.an, = ^ 


Kn, = J 

| Vu(x)| 2 dx. 

Mi/ 2 , an, = J 


u(t) — u(s) 


dt ds. 


Let Af(ft) be a finite dimensional nonconforming finite element space of Lagrange-type defined 
subordinate to the triangulation T that vanishes at all degrees of freedom on dSl. Since V(fl) is of 
Lagrange- type, the elements in V(f 1) may be expressed in terms of a nodal basis, and we may 


k 
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identify an element in A/"(ff) with the values it attains at the nodal points. For convenience, we 
assume that the subdomains and the elements are triangular in two dimensions or tetrahedral in 
three dimensions. Extensions to other shape regular decompositions are straightforward. 

We consider the problem of finding p/, G such that 

d(Ph,qn)= I fqhdx Vq h eAf( ft), (3) 

where d is the generalized Dirichlet form: 

d(Ph,qh) = dn(p h ,q h ), dn>{p h ,q h ) = Y I AVp h ■ Vr/ /t dx. 

reT, TCfi' Jt 

We now introduce several conventions used in this paper. In this paper, we shall only be 
concerned with the solution of this finite dimensional problem, and will henceforth drop the “h” 
subscript. 

Having defined a parent finite element space of functions A’(fl) with a nodal basis and a set 

C we will simply write A"(fF) for the restriction of A'(fi) to fF, i.e. 

X(n') = {<!> \ n .\<f>ex(d)}. 

By an abuse of notation, we consider an element <f> £ A’(fF) also to be an element of A’(fi) by 
setting ({> to zero at all nodes outside of £F. 

We will write Q\ ~ Q 2 if two quadratic, forms Q\ and Q 2 with the same domain X> are 
equivalent , i.e. if there exists constants ci,c 2 > 0 such that 


ciQi(+,4>) < Q 2 (<M) < V<f> e V. 


In what follows, C will be used to denote a generic constant that may not be the same from one 
line to the next. This constant, as well as the constants involved in the equivalence of quadratic 
forms, will always be independent of h and //, but can depend on the constants in (2), the shape 
regularity of the subdomains, the degree of the nonconforming finite elements, and the regularity of 
the triangulation. 

To conclude this section, we prove a lemma that provides an equivalent quadratic form for d ( •, •) 
in terms of the nodal degrees of freedom. The proof of this lemma was suggested by Joseph Pasciak 
in the context of the mixed finite methods considered in [6, 7], 

Lemma 1 Let O'CO be the union of elements of T . And let A(.r) = a(x)A(x), where a is a 
positive, piecewise constant function with value a T on reT. Then for every p £ Af(fl'), 

dm(p,p)~ ]T 0 ' T |r| 1 -'^" Y, (/'(">) ~ p(n } )) 2 . (4) 

t £ T , nodes : 

fC fi' 71 1 , 71 j € T 


The constants that appear in the definition of the equivalence do not depend on the constants in (2), 
but rather on constants that arise when A is replaced by A. 


Proof. The local kernel of d T (-, ■) in A /"(r) is exactly the constant functions on r since for 
p £ A/"(t) 


d T (p, q) — 0 W/ £ A”( t ), Vp = 0. 
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Figure 1: Refinement of the 2D P-1 element and a partial refinement ol the 3D P-1 element. 

A^(t)/IR. Since all norms are equivalent on finite dimensional 
d T (p,p) — o ; r|'r| 1_2 / n X/ (p{ n i) ~ P{ n ])) ’ 


Hence, (d T (-,-)) 1/2 i s a norrn on 
spaces, we see that 


nodes : 
n,, Tij E t 


by a simple scaling argument. The proof is completed by summing over the elements of 7 in fi'. □ 


A CONFORMING EQUIVALENCE 


In this section, we construct a conforming space that is isomorphic to Af(Q.) using the techniques 

in [6 71 and recall some basic properties about the isomorphism. 

Given an element r € T, let f T be a subtriangulation of r such that the vertices of the 
subtriangulation include the vertices of r and the nodal points in r pertaining to the degrees of 
freedom of M(t). Every element in the new triangulation should have at least one vertex that 
corresponds to a nodal point of M{t). Moreover, the subtriangulations should be constructed in 
such a way that the union of subtriangulations gives rise to a refined quasi-regular triangulation of 

fl which we denote by , , „ 

■T s [J %. 

tET 

A vertex of T will be called primary if it was a nodal point corresponding to a degree of freedom of 
otherwise, we call the vertex secondary. We say that two vertices ot the triangulation are 
adjacent if there exists an edge of f connecting the vertices. An example of the subtriangulation of 
the P-1 element that has nodal degrees of freedom at the center of its edges (faces) is given in 

Figure 1. 
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Let [//.(ft) denote the space of continuous piecewise linear functions subordinate to the 
triangulation f that vanish on 5ft. For ft' C ft, a union of elements, define {//.(ft') by restriction, i.e. 

Uh{Q') = { U \Q' I u £ Ui^Q.)}. 

Since the functions in [//.(ft') are naturally parameterized by the values they attain at the 
vertices, we can define a pseudo-interpolation operator 1? into f//.(ft') for any function (f> defined at 
the primary vertices contained in ft' by 


( 0, if x € 5 ft' fi c?ft; 


4>{x\ if x is a primary vertex not in 5ft' O 5ft; 


2^V(x) 


The average of all adjacent primary vertices on the boundary 
of ft', if x is a secondary vertex in 5ft' \ 5ft; 

The average of all adjacent primary vertices, if x is a secondary 
vertex in the interior of ft'; 


The continuous piecewise linear interpolant of the above vertex 
values, if x is not a vertex of T. 


(5) 


Since 2°' is well defined for any function defined at the primary vertices, by an abuse of notation, 
we can understand IP both as a map from A/"(ft') into [//.(ft') and a map from [//.(ft ) into itself. 
For any ft' that is the union of elements in T, let [//.(ft') C [//.(ft') denote the range of 2° ; that is, 

u h (si , ) = {^ = f l 'q,qe. Af(ft)}. 

We now prove that 1?' : A/(ft') — > (//.(ft') preserves the norm induced by the bilinear form ■) 

on jV(ft') and the /P-seminorm on (//.(ft'). Since I 11 is a bijection between A /"(ft') and (//.(ft') by 
construction, this proves that A/(ft') and (//.(ft') are isomorphic. 

Theorem 2 Let ft' C ft be the union oj elements. Then }or all p £ A /"(ft'), 

«MPiP) - V ( 6 ) 

Proof. This proof is an expanded version of the proof given in [7]. Recall that for (f> £ (//.(ft'), 

- Y M 1-2/n Y (<K^) - < K t b)) 2 - ( 7 ) 

T ^ If vertices : 

rC!)' v,,v } eT 


By virtue of Lemma 1 and Equation (7), it. is enough to show that 

£ |r|*-w* £ M"i) - i*K)) ! - £ £ ((PVKvi) - (Fp)fe)) ! . (8) 

r e T, nodes : t £?, vert ices : 

r c n' Hi , n j e r T c o' "■•"j e T 

Since vertices of T t contain the nodal points of r and p = 1 Q 'p at these points, we have 

£ W-,)-i>(»,)f<c£ £ ((?>',.) M - (F P ) M)* , 

nodes : t£T t vertices^ 

n t - , rij G r vi, vj G r 
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where the constant is controlled by the regularity of the subtriangulation. Hence, by summing over 
the elements of T in ST, we conclude that the right hand side of (8) dominates the left hand side. 

To prove that the left hand side dominates the right, we note that the differences in the right 
hand side are of three types: the difference at two primary vertices, the difference at two secondary 
vertices, and the difference at a primary and a secondary vertex. Since p and T 1 p agree at primary 
vertices of T, the difference at two primary vertices occurs as a term in the left hand side. For two 
secondary vertices V\, v 2 in an element r € T containing a primary vertex v v , we see that 


((^p) (vi) - (Z?'p) M)* < 2 ((2°'p) (i’i) - (l°'p) (v P )) 2 + 2 ((Z°'p) (v 2 ) - (Z^'p) (v p )) 2 . 

Hence, it is enough to bound the difference at a secondary and primary vertex by terms in the left 
hand side of (8). 

Let v n +i be a secondary vertex with adjacent primary vertices v J} . . . , v n , and let pj = p(vj). 
Noting that for j = 


(z^'p) ( l ’j) = Pj J (2°'p) ( V n+l) = - it, (^p) ( V j) = ~ LPj’ 

12 j = l 71 j- 1 


we see that 


({z^p) ( v n+i) - (z?p) ( v >)) 



by the Cauchy-Schwarz inequality. The proof is completed by summing over all triangles of T . The 
number of such terms, and hence the constant in the bound, is controlled since the regularity of the 
mesh implies that there is an a priori maximum number of adjacent elements that can share a 
secondary point. □ 

Using the techniques in the proof of Theorem 2, the following lemma is easy to prove. 


Lemma 3 There exists a constant C depending only on the regularity of the triangulatio7i T and 
the degree of the nonconforming space such that for any IT C U, the union of elements of T , 


1 5° < C\<f> Ia-,0' V0 £ U/,(U'), A: — 0, 1. 


( 9 ) 


THE DRYJA-WIDLUND ADDITIVE SCHWARZ METHOD 

The presentation in this section and the next follows the treatment of Schwarz methods given by 
Dryja and Widlund in [4]. We concentrate only on the additive Schwarz methods with exact solves. 
The convergence rate of the multiplicative Schwarz method may be estimated in terms of the same 
quantities (see [13]) and is easily worked out. Extensions to inexact solves are likewise direct. 

Recall that the additive Schwarz method with exact solves for (3) is completely determined by a 
decomposition of the finite element space A/^D) = Ao + Ai + . . . + Am* For each subspace Af n 
define an operator P x : — * A by 

d(PiP, q) = <l(p, q) W/ € M,. (10) 
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The additive Schwarz algorithm with exact solves for (3) involves the solution of 

M M 

pp = h p = Ep t , / = E/» (ii) 

i— 0 t=0 

where f t G Mi is defined by 

d{f„q) = ffqdx VqtM,. 

Abstract bounds on the condition number of P have been derived in terms of two quantities, Co 
and the spectral radius of £, which we now define. Let Co be a constant such that for every p G M 
there exists a representation p = with pi G Mi satisfying 

M 

< C 0 d(p,p). (12) 

2 = 0 

Let p(£) denote the spectral radius of £ = { } , the matrix ot strengthened Cauchy-Schwarz 
constants; that is, e tJ is the smallest constant lor which 

\<l(pi,Pj)\ < £ijd(Pi,Pi)*d(pj,p Vp, <E Mu Vpj G Mj, i,j > 1. (13) 

The next theorem, due to Dryja and Widlund [14], bounds the condition number of the additive 
Schwarz method in terms of Co and />{£)■ 

Theorem 4 The eigenvalues and the condition number n(P) of P satisfy 

A min (P) > Co 1 , A inax (P) < ( P (£) + 1), k{P) < Co( P (£) + 1). (14) 

To construct the decomposition of Af(fi) to be used in our application of the additive Schwarz 
algorithm for nonconforming elements, we first create an overlapping decomposition of the domain 
f l by extending each subdomain ft, to a larger region ft' which is also the union of elements of T . 
We characterize the extent of the overlap of the partition {ft'}-^ by 5, where 

6 = min dist(<9ft t \ 5ft, 5ft' \ 5ft). 

The decomposition {ft-}^ gives rise to a natural decomposition of M(£l) by letting M x C M (ft) 
denote the set of functions that vanish at all nodes in the closure ol (ft \ ft'). In order to provide a 
mechanism for global exchange of information between subdomains so as to enhance the rate of 
convergence, we also use a low dimensional space defined by 

M 0 = { P e M(Q) | p = ft)}, 

where X ^ is nodal interpolation into M(£l), and U u ( fi ) is the space of continuous functions that are 
linear on each subdomain f 2,. Note that the subspaces for the nonconforming space are exactly the 
nodal interpolants of the standard decomposition of the conforming space Uu( ft), namely, 

f/an)n^(o;.). 

In the following lemma we recall the crux of the proof due to Dryja and Widlund (Theorem 3 of 
[4]) that the Schwarz method applied to the conforming Galerkin discretization has a condition 
number that is 0(1 + (H/6)). 
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Lemma 5 For every <fr G Uh(ft), there exists a decomposition <fr = Xlf=o w ith (fro G U//(fl), 

<fri € Uh(£l) D //o(fl'), 1 < i < M and a constant C independent of h, H , and 8, such that 

M / H\ 

5>4o < ^(i + yjHU ( 15 ) 

We now show that the application of the Schwarz method to the nonconforming space converges 
at the same rate. 

Theorem 6 The condition number n(P) of the additive Schwarz operator P defined by (11) induced 
by the decomposition Af (ft) = Afo + . . . + A/m °f the nonconforming finite element space satisfies 

k(P) < C ^1 + • 

The constant C is independent of h, 8, and H . 

Proof. The verification that the largest eigenvalue of P is bounded by a constant is standard. 
Since d(pi,pj ) = 0 for p, G Afi, p j G Afj with D' H O' = 0, P may be written as the sum of an a 
priori bounded number of disjoint projections. Since projections have unit norm, a constant bound 
on the largest eigenvalue of P is immediate. See, e.g., Lemma 3.1 of [2]. 

For p G Af(Cl), let (TPp), denote the decomposition of iPp G f/y t (0) arising in Lemma 5, and set 
Pi = J M ((1 Q p)i)- It is easy to check that p, G Af, and p = Jffio /'•• Using Theorem 2 and Lemma 3, 
we see that for i = 0 , . . . , M , 

d( Pi ,Pi ) < C|2 n ((f t V)i)li.n < C\VPp)i\l Q . 

Summing and applying Lemma 5 and Theorem 2, we conclude that 

M M / H\ ~ / H\ - - ' 

< C£|(?V),i;,„ < C (l + j) l^pltn < c (l + j) 

»=0 *=0 

Hence, C 0 in ( 12 ) is bounded by (7(1 + H/8). An application of Theorem 4 completes the proof. □ 

SUBSTRUCTURING DOMAIN DECOMPOSITION 

The remaining two methods considered in this paper are domain decomposition methods applied 
to a reduced problem involving only the degrees of freedom on the internal interfaces of subdomains 
T = \ d£l. Following [4], we recall the construction of the reduced problem. Since Af [Cl) is 

of Lagrange-type, we may associate with functions p, q £ Af(il) the vectors of values they attain at 
the nodes. Let x and y denote the vectors of nodal values of p and < 7 , respectively, and x’h) ? pb) the 
subvectors of degrees of freedom in Q t . Let Z)h) denote the local stiffness matrix arising from 
and let D denote the global stiffness matrix, i.e. 

x {i]T D {t) y {,) = dn,(p,q), x T Dy = x [,)T D {l) y (t) = d(p, q). 

«= 1 , „M 
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For each subdomain, we can partition the degrees of freedom into two sets, the ones related to 
nodes on the boundary of Q, denoted Xg\ and the ones corresponding to nodes in the interior of fl, 
denoted Such a partitioning induces a partitioning of Z)h) given by 


x {i)T D {i h 


(0 _ 



The interior unknowns of each subdomain may be eliminated in terms of the boundary unknowns. 
The resulting matrix, S , is the Schur complement with respect to the interface unknowns defined by 


1 B 


SyB = 


£ 

i=l,. ...Af 


X 


B where 


S (i| = (£>«?)-'£>« . 


It will be convenient to work with the bilinear forms induced by S and and so we define 

s(p,(j) = x T B Sy B , .s,(p,<y) = x ( B ■S’ (, b/^ ) . 

For a function p £ V(fl), we note that unlike conforming spaces, the restriction of p to the 
interfaces, P|r, is not solely determined by the nodal values on T since ,A/”(fl) is nonconforming. 
Hence, we are careful to understand A/"(r) as a subset of ,A/"(f2) parameterized by the nodal values 
on T consisting of the discrete harmonic extension of the nodal values to the interior of the 
subdomains. Specifically, if p 6 V(T) has the vector of nodal values .t^ on dfh, then p^ is the 
function associated with the vector of nodal values (x\'\ x$) T where x\ l) satisfies 


A linear functional g is easily constructed such that finding p £ V(r) satisfying 

s (p> <i) = y{<i) £ V(r) (16) 

is essentially equivalent to (3). 

We now construct a conforming space of functions that is isomorphic to A/"(r) with the norm 
induced by the bilinear form s(-, •). Let Uu(T) denote the restriction of Uu(^l) to \jfL l dVl l . Since 
functions in U h (T) vanish on dtt (because functions in U k {Sl) do), functions in ^(f) can be 
parameterized in the natural nodal basis by the values they attain at the vertices of T in T. 
Analogous to (5), for P the union of edges (and faces in 3D) in the triangulation T and <j> a 
function defined at the primary vertices in P, define a pseudo-interpolant 2?'<f> £ U h ( P) by 

[ 0, if x £ P n dCl] 




if X is a primary vertex not in F' fl dQ; 

The average of all adjacent primary vertices on F' if x is a 
secondary vertex on P r ; 


( 17 ) 


The continuous piecewise linear interpolant of the above vertex 
values, if .t is not a vertex of T. 
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Note that if r - dVt', then T<f> = (2°^)|an- for all 4> in JV(W) that agree with <f> at the nodal 
degrees of freedom of dCl'. 

Since is well defined for any function defined at primary vertices, by an abuse of notation, we 
can understand both as a map from A f(Y') into Un(T') and a map from U h (T') into U h (V). We 
denote the range of jF by 

Uk{ r') = e t//,(r')}. 

The equivalences in the following lemma are a combination of the standard trace theorem and an 
extension theorem for U h {dtti). In particular, the proof of this lemma given in [6] shows that the 
space Uh(fti) is rich enough to inherit the Extension Theorem of Widlund [15] from U h {tti). 


Lemma 7 For <j> € Uh{d$li), 
1 1 / 2 , an, - 


inf H^lli.n,, 
4> € 

4>\dn t — 4 


|^|i/ 2 ,an, - inf l^li.n,- 
4> e t/„(no 
4>\m t = $ 


(18) 


Additionally, there exists a constant C independent oj mesh parameters such that 

|2 0O, ^| fci 0O < < C’l^lr-.an,, € Uu{dfli), k — 0,1/2. (19) 

The following theorem plays the role of Theorem 2 for the inteiface problem. 

Theorem 8 For all p € Af(r), 

«to,r) * l?‘Vl?/j*v (20) 

Proof. By a direct computation followed by an application of Theorem 2 and Lemma 7 noting 
that Uh{Sli) = l^CA/Xfl,)), we see that 


s i(Pi P ) — _ inf dn, (p, p) ~ _ ini p\ 

p 6 Af(0,) P_€ A/( n,) 

P|dfi, = P. Plan, = p 


i/ 2 , an- 


0 


SMITH’S VERTEX SPACE METHOD 

Smith’s vertex space method [2] is an additive Schwarz method applied to the interface problem 
(16). The decomposition of Af(T) is constructed slightly differently in two and three dimensions. In 
both cases, we first partition V into overlapping subsets based on^its decomposition as the boundary 
of subdomains. In two dimensions, for each vertex V, ol T, let IV denote the set of points on T 
that are less than a distance 6 from Vj. For each edge Ei of T, let. IV denote the interior of the 
edge E { . In three dimensions, for each vertex Vj, each edge E„ and each face F k of I\ define T S J as 
above, let rf* denote the interior of the face F k , and let Yf denote the set of all points in strips of 

width 8 on all faces which share the common edge E,. 

Understanding the set of faces to be empty in two dimensions, the decomposition of T into 

subsets induces a decomposition of A^(r) by considering 

AT(D = £ A/Xlf), 

Ge{H,E„V J ,F k ) 
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where for G € {E t , Vj, F k }, Af(Tf) C AZ^T) are those functions that vanish at all nodal points on T 
that are outside of the set T®, and jV’(r^) C A/"(r) are those functions that are the nodal 
interpolant of the restriction to T of continuous functions that are linear on each subdomain fij and 
vanish on dCl. 

The following lemma is the crux of the analysis of Smith’s method by Dryja and Widlund [4] for 
conforming elements. 

Lemma 9 For every <f) € U k {Y'), there exists a decomposition 

<t> = £ 4 >g 

Ge{H,E„ V Jt F k ] 

with 4 >h e U„(T), <f> G € U h (Tf) = Uh{T) n ^(rf) for G e {E u Vj, F k ) such that 

M M 

E ElAHwn, <C(l + lo 6 (ffM)) 2 EW;/2.aa- (21) 

Ge{w,£:„v 3 .F*> v=i f=i 

The constant C is independent of the choice of (j>, and, the mesh parameters h, H , and 6. 

Let Fp : A f(T) — > A^T) denote the additive Schwarz operator defined by (10) with the bilinear 
form d(-, •) replaced by the interface form s ( *, -) and the decomposition of A^H) replaced by the 
decomposition of Af(T) described above. We now prove that the condition number of Ar for the 
nonconforming space has the same bound given in [4] for the similar operator for the conforming 
finite element space. 

Theorem 10 The condition number of the additive Schwarz operator Pp for Smith’s decomposition 
for the nonconforming finite element discretization satisfies 

K(P r )<C((l + \o & (H/6))\ (22) 


The constant C is independent of the mesh parameters h, H } and S. 


Proof As in the proof of Theorem 6, Pp may be written as the sum of an a priori bounded 
number of disjoint projections, and so the largest eigenvalue of Pp is bounded by a constant. 

To bound the smallest eigenvalue, we also proceed as in the proof of Theorem 6. For p £ A r (T), 
set Pg_= G £ {/f, F,-, Vj, P*}, where Zp is interpolation at the nodes on T into A/"(T) 

and (2 *p)g is the decomposition of l?p <E U k (T) that arises in Lemma 9. Since ZFp and p agree at 
the nodal degrees of freedom of A^T), and 

W(r? ) = if(u u {r)), Mir?) = i^(c/j,(rf )) vg € {c.u.nj, 

it is easy to check that 

P = £ PG- 

Ge{H,E„V 3 ,F k } 

Working one subdomain at a time and using Theorem 8 and Lemma 3, we see that for G — H and 
for G 6 { Ei , Vj, F k } such that Tf fl dfl , ^ 0 we have 


*i(w.w) < C IfV/li/j.™, = £ C| (T'p)a 


li/ 2 ,anr 


(23) 
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( 24 ) 


Assume that we can prove that there exists a constant independent of h, H and 8 such that 


M M 

£ \?r\\n,m, < c £ v,. e V(r). 

1 = 1 


1 = 1 


Then by summing (23) over subdomains and subspaces, noting that Si(pG,pa ) — 0 if F^ 0 d£l, — 0, 
and applying Lemma 9, Equation (24), and Theorem 8, we see that 


M 


M 


y. s(pg,pg ) = y s i {P g,Pg) < C (1 + log ( H/8 )) y 

GefH.Ei.Vj.Ffc} Ge{//,E,,V J ,F*} «=1 *' =1 

M 

< C ( 1 + log (H/8)) 2 y |2° n 'pli/2,an, < C (1 + *°S (^/^)) 2 5 (p>p)- 


i=l 


The proof of the condition number bound now follows from an application of Theorem 4, and we 
are only left to verify (24). 

Define a pseudo-int.erpolant 2^ : Af(Sl) -> U h (Cl) by (5), noting that the boundary of ft \ T is 
dft U T. Using the techniques in the proof of Theorem 2, it is easy to show that there exists a 
constant C\ depending only on the regularity of the mesh and the degree of the nonconforming 
space such that 

M M 

£ |F' r pi;,„. < C, £ [T'rU. V,. e V(fl). 

t=l *=1 

By Lemma 7, for each p G Af(T) there exists an extension p E G A/\ft) that agrees with p at the 
nodal points on T such that 

|ZVl?.n. < C|P%|? /2 , 9n , i = l,. • ■ ,M. 

Combining these results after another application of Lemma 7 with </> = Tf )>, we conclude that 
M M M _ 

£|Fpi; /2 ,a n , < C£|?Vi;.n. < C£|2Vltn. < C\t%W l%sn ,, 


which verifies (24). □ 

In [6], the interface form arising from the discretization by mixed finite elements of (1) was 
shown to satisfy Theorem 8 with V(r) replaced by the appropriate space of interelement 
multipliers. Hence, the proof given above is applicable to discretization by mixed finite elements, 
and we arrive at the following corollary. 

Corollary 11 The application of Smith’s decomposition method to the dual-variable mixed finite 
element formulation discussed in [()] results in an operatoi ■ whose condition number grows at worst 
like 0((1 + log (H/8)) 2 ). 


BALANCING DOMAIN DECOMPOSITION 


As the final domain decomposition method considered in this paper, we investigate the balancing 
domain decomposition method of Mandel [3] applied to nonconforming finite elements. The method 
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involves the iterative solution (usually by conjugate gradients) of (16) preconditioned by the 
balancing preconditioner described in Algorithm 1 below. Each iteration involves the solution of a 
local problem with Dirichlet data, a local problem with Neumann data, and a “coarse-grid” 
problem to propagate information globally and to insure the consistency of the Neumann problem. 
The theory and practical performance of balancing domain decomposition for the standard 
conforming Galerkin finite element method and mixed finite element method are the subjects of [5] 
and [6], respectively. As in previous sections, we will deduce the convergence theory for the 
nonconforming spaces from the conforming theory in [5] using the isomorphism introduced in the 
fifth section of this paper. 

One remarkable property of balancing domain decomposition is that the bound on the condition 
number of the preconditioned operator is independent of jumps in coefficients across subdomains. 
Specifically, let the tensor A in (1) be written as A(x) = a(a;)A(x), where a is a positive function 
that is piecewise constant with constant value a, on fl,-. The uniform ellipticity then implies that 
there exists positive constants c,,c* such that 

c.an( T ( < { T A(x){ < c*ai{ T { € M n ,Vx 6 (25) 

The bound on the condition number of the operator that arises in balancing domain decomposition 
will depend on c* and c* but will be independent of a, and c, and c* in (2). 

Following Mandel’s original exposition in [3], we now recall the balancing preconditioner in terms 
of matrices. A equivalent variational presentation is given in [6]. By an abuse of notation, we use 
the same symbol to denote an element in Af(T) and its associated vector of values attained at the 
nodal degrees of freedom. 

The balancing preconditioner is parameterized by two sets of matrices, a set of weighting 
matrices and a set of kernel generators { Z{ The weighting matrices 

Wi : Af(dSli) -* A are chosen such that, they form a decomposition of unity on V(r), i.e. 

M 

J2n,W,N? p = p VpeMH, 

i= 1 

where denotes the canonical inclusion mapping N, : Af(dCli) — ► A/^r) by extending elements of 
A/"(dfi,) by zero at all other degrees of freedom. A prescription for the weighting matrices that 
guarantees a convergence bound independent of coefficient jumps between subdomains is given in 
Lemma 12 below. For each subdomain fl,, let n, = dim(A/ r (5fl,)), and select an n, x m, matrix Z, of 
full column rank with 0 < m, < n,, such that 

Ker5, C Range Z,, i = 1, . . . , M. (26) 

For the scalar, second order, elliptic problems we consider in this paper, Kei\S, is empty if there is 
Dirichlet data imposed on any part, of <9fl, fl <9Q, otherwise it is the set of functions that have the 
same value at all the nodes on df 1,. From the kernel generators, we construct a “coarse space”, 

Nh C Af(r), defined by 

M 

A r H = {pe AT(T) : p = Y, NiWiZ, z e RangeZ,}. 

1=1 

We say that q 6 A/"(T) is balanced if it is orthogonal to A///; that is, 

ZjW?Njq = 0, * = (27) 
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Let |p,||. = Si(pi,pi). Considering those jn that are orthogonal to the range of Z t , working one glob 
at a time, and using (36), (19), Lemma 13 and (36) in that order, we have 




< a,C( 1 -f log ( H 

< — C(1 + log(tf//l)) 2 |pi|| ; - 

a t 


(37) 


By the construction of the decomposition, there is an a priori maximum number of globs that 
intersect dCli H dClj. Summing over such globs, we conclude that 


Sj(Nj Nipi,Nj N iPi ) < — C(1 + \o & {H/h)) 2 s,{p t ,p,). 

O, 


The proof is completed by appealing to the bound in Lemma 12. □ 
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SUMMARY 


Recently the GMRESR inner-outer iteration scheme for the solution of linear systems of equations has 
been proposed by Van der Vorst and Vuik. Similar methods have been proposed by Axelsson and 
Vassilevski [1] and Saad (FGMRES) [10]. The outer iteration is GCR, which minimizes the residual over a 
given set of direction vectors. The inner iteration is GMRES, which at each step computes a new direction 
vector by approximately solving the residual equation. However, the optimality of the approximation over 
the space of outer search directions is ignored in the inner GMRES iteration. This leads to suboptimal 
corrections to the solution in the outer iteration, as components of the outer iteration directions may 
reenter in the inner iteration process. Therefore we propose to preserve the orthogonality relations of GCR 
in the inner GMRES iteration. This gives optimal corrections; however, it involves working with a singular, 
non-symmetric operator. We will discuss some important properties and we will show by experiments that, 
in terms of matrix vector products, this modification (almost) always leads to better convergence. However, 
because we do more orthogonalizations, it does not always give an improved performance in CPU-time. 
Furthermore, we will discuss efficient implementations as well as the truncation possibilities of the outer 
GCR process. The experimental results indicate that for such methods it is advantageous to preserve the 
orthogonality in the inner iteration. Of course we can also use other iteration schemes than GMRES as the 
inner method. Especially methods with short recurrences like BICGSTAB seem of interest. 


INTRODUCTION 


For the solution of systems of linear equations the so-called Krylov subspace methods are very popular. 
However, for general matrices no Krylov method can satisfy a global optimality requirement and have short 
recurrences [5]. Therefore either restarted or truncated versions of optimal methods, like GMRES [11], are 
used or methods with short recurrences, which do not satisfy a global optimality requirement, like BiCG 
[6], BICGSTAB [14], BICGSTAB(i) [12], CGS [13] or QMR [8]. Recently Van der Vorst and Vuik proposed 
a class of methods, GMRESR [15], which are nested GMRES methods; see Fig. 2. The GMRESR 
algorithm is based upon the GCR algorithm [4]; see Fig. 1. For a given initial guess xq, they both compute 
approximate solutions x* ? such that x* — xo E span{ui 7 U 2 , . . . , u*} and ||r *||2 = ||6 — Ax ^ [[2 is minimal. 
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GCR: 

1. Select xo, m, tol; 

r 0 = b- Axq, k = 0; 

2. while ||r*|| 2 > tol do 

k = k + 1; 

= ’"fc-i; Cfc = Au k ] 
for i = 1, k — 1 do; 
«i = cf c k ; 

Cjc = Cfc 


cjfc = c fc /||c fe ||; 

«fc = «fc/||c*:||; 

Xk = Xfc-i + (c£V fc _i)ii fc ; 

rk = rk - 1 ~ (c k r k -i)c k ; 


Figure 1: The GCR algorithm 


GMRESR: 

1. Select xo, m, tol ; 

ro = b — Axq, k = 0; 

2. while ||rjt ||2 > tol do 

k = k + 1; 

u k = 'Pm,k{A)r k - 1 ; c fc = Atz fc ; 
for i = 1, . . . , k — 1 do 
«i = cj c fc ; 

c k — c k “ 

- <**!!<; 

= Cfc/||Cfc||2j 

= ufc/llcfelb; 

Xfc = xjt_i +(cjrfc_i)u fc ; 
rfc = rfc_i - (c^rfc-i)cfc; 

V m ,k(A) indicates the GMRES polynomial that 
is implicitly constructed in m steps of GMRES 
when solving Ay = rk- 1 - 

Figure 2: The GMRESR algorithm 


However, they compute different direction vectors u k - GCR sets u k simply to rfc_i, while GMRESR 
computes u k by applying m steps of GMRES to rfc_i (represented by Vm,k{A)r k -i in Fig. 2). The inner 
GMRES iteration computes a new search direction by approximately solving the residual equation and 
then the outer GCR iteration minimizes the residual over the new search direction and all previous search 
directions u*. The algorithm can be explained as follows. 

Assume we are given the system of equations Ax = b, where A is a real, nonsingular, linear (n x n)-matrix 
and 6 is a n-vector. Let U k and C k be two ( n x fc)-matrices for which 

C k = AU k , C?Cfc = /fc, (1) 

and let x 0 be an initial guess. For x k - x 0 € range{U k ) the minimization problem 

||6-Axfc|| 2 = min ||r 0 - Ax|| 2 . (2) 

xerange{ll) 

is solved by 

Xk = xo + U k Clr 0 (3) 

and Tk = b — Axk satisfies 

r k = r 0 — C k C k ro, r k -L range(C k ) . (4) 

In fact we have constructed the inverse of the restriction of A to range{U k ) onto range(C k ) ■ This 
inverse is given by 

A- l CkCl =U k Cl. (5) 

This principle underlies the GCR method. In GCR the matrices U k = [ui u 2 . . . u k ] and C k [ci c 2 . . . Cfc] 
are constructed such, that range(U k ) is equal to the Krylov subspace 

K k (A; r 0 ) = span{r 0 , Ar 0 , . . . , A fc_1 r 0 } . Provided GCR does not break down; i.e. if c k / rfc_i, it is a finite 
method and at step k it solves the minimization problem (2). 
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Consider the fc-th step in GCR. Equations (l)-(3) indicate that if in the update u k = r^-i (in GCR), we 
replace r*-i by any other vector, then the algorithm still solves (2); however, the subspace U * will be 
different. The optimal choice would be Uk = e*;_i, where is the error in xjt-i- In order to find 
approximations to e^-i, we use the relation Aek-i = t^-i and any method which gives an approximate 
solution to this equation can be used to find acceptable choices for u*. In the GMRESR algorithm 
GMRES(m) is chosen to be the method to find such an approximation. 

However, since we already have an optimal Xk-i , such that Xk-i — xo £ range(Uk- 1 ) , we need an 
approximation u k to e^.i, such that c k = Au k is orthogonal to range(Ck~ 1 ) . Such an approximation is 
computed explicitly by the orthogonalization loop in the outer GCR iteration. Because in GMRESR this is 
not taken into account in the inner GMRES iteration, a less than optimal minimization problem is solved, 
leading to suboptimal corrections [2] to the residual. Another disadvantage of GMRESR is that the inner 
iteration is essentially a restarted GMRES. It therefore also displays some of the problems of restarted 
GMRES. Most notably it can have the tendency to stagnate (see Numerical Experiments). 

From this we infer that we should preserve the orthogonality of the correction to the residual also in the 
inner GMRES iteration. In order to do this we use Ak - i = (I — Ck-iC^_i)A as the operator in the inner 
iteration. This gives the proper corrections to the residual: c* E K Tn (Af c -i\ Ak-irk-i)- However, the 
corresponding corrections to the approximate solution (contrary to ordinary implementations of Krylov 
methods) are found by u* = A~ 1 Ck € A~ l K Tn (Ak~\] Ak-irk-i)- These corrections can be computed since 
the inverse of A is known over this space. Equation (5) gives: 

A-'Ak-i = A- 1 A - A-'C^Cl^A = 1- U^C^A. ( 6 ) 

This leads to a variant of the GMRESR iteration scheme, which has an improved performance for many 
problems. 

In this article we will consider GMRES and BICGSTAB as inner methods. In the next section we will 
discuss the implications of the orthogonalization in the inner method. It will be proved that this leads to 
an optimal approximation over the space spanned by both the outer and the inner iteration vectors. It also 
introduces a potential problem: the possibility of breakdown in the generation of the Krylov space in the 
inner iteration, since we iterate with a singular operator. We will show, however, that such a breakdown 
can never happen before a specific (generally large) number of iterations. Furthermore, we will also show 
how to remedy such a breakdown. We will also discuss the efficient implementation of these methods and 
how we can truncate the outer GCR iteration. Outlines of the algorithms can be found in [7], [2]. 


CONSEQUENCES OF INNER ORTHOGONALIZATION 


To keep this section concise, we will only give a short indication of the proofs or omit them completely. 

The proofs can be found in [2]. Throughout the rest of this article we will use the following notations: 
o By Uk = [u \ . . . Ufc] and Ck = [c\ . . .Cfc] we denote matrices that satisfy the relations (1); 
o By Xk and r*. we denote the vectors that satisfy the relations (2)-(4); 
o By Pk and Qk we denote the projections defined as Pk = CkC^ and Qk = UkCj A; 
o By Ak we denote the operator defined as Ak = (J — F fc )A; 

o By Vm = [vi . . . , Vm\ we denote the orthonormal matrix generated by m steps of Arnoldi (GMRES) 
with Ak and such that v\ = rjt/||rjt|| 2 - 
From this and (6) it then follows that 

AQ k = P k A , and A~ l A k = (7 - Q k ). (7) 
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i in i 


We will describe the (k + l)-th step of our variant of the GMRESR iteration scheme, where in the inner 
GMRES iteration the modified operator A k is used. We use m (not fixed) steps of the GMRES algorithm 
to compute the correction to r k + 1 in the space K m (A k ] A k r k ). This leads to the optimal correction to the 
approximate solution xjt+i over the ‘global’ space range(U k +\) © A l K m (A k ; A k r k ). 


Theorem 1 The Amoldi process in the inner GMRES iteration defines the relation A k V m — Vm+i H m , 
with H m an ((m + 1) x m) Hessenberg matrix. Let y be defined by 

y : min \\r k - A k V m y II 2 = m in \\r k - V r m+ ifl' m y||2- ( 8 ) 

yeH m yGJR m 

Then the minimal residual solution of the inner GMRES iteration: ( A ^A k V m y) gives the outer 
approximation 

%k + 1 = “b (i-Q k )V m y, (9) 

which is also the solution to the ‘global’ minimization problem 

x fc+ i : min ||6 - Ax|| 2 (10) 

3i€range(U )© 

range(Vm) 


It also follows from this theorem that the GCR optimization (in the outer iteration) is given by ( 9 ), so that 
the residual computed in the inner GMRES iteration equals the residual of the outer GCR iteration: 
rfc+ i =b - Ax k+ 1 = 6 - Ax k - A k V m y = r k — AkVmy. From this it follows that in the outer GCR iteration 
the vectors u* + i and c k+ \ are given by 

cfc+i = (A k V m y)/\\A k V m yh, (H) 

u k+ 1 = ((I-Qk)Vmy)/\\A k V m y\\ 2 . ( 12 ) 

Note that (I — Q k )V m y has been computed already as the the approximate solution in the inner GMRES 
iteration; see ( 9 ), and A k V m y is easily computed from the relation A k V m y = V m +iHy m . Moreover, as a 
result of using GMRES in the inner iteration, the norm of the residual rjt+i as well as the norm of A k V m y 
is already known at no extra computational costs. Consequently, the outer GCR iteration becomes very 
simple. 

We will now consider the possibility of breakdown when generating a Krylov space with a singular , 
nonsymmetric operator. Although GMRES is still optimal in the sense that at each iteration it delivers the 
mini mum residual solution over the generated Krylov subspace, the generation of the Krylov subspace 
itself, from a singular operator, may terminate too early. The following simple example shows that this 
may happen before the solution is found, even when the solution and the right hand side are both in the 
range of the given (singular) operator and in the orthogonal complement of its null-space. 

Define the matrix A = (e 2 e 3 e4 0 ), where e t denotes the i-th Cartesian basis vector. Note that 
A = (/ — e\ e'f)(e2 e$ e4 ei), which is the same type of operator as A k , an orthogonal projection times a 
nonsingular operator. Now consider the system of equations Ax = e3- Then GMRES (or any other Krylov 
method) will search for a solution in the space 

span{e3,Ae3,A 2 e3,...} — span{e3, e^, 0 , 0 , . . .} . 

So we have a breakdown of the Krylov space and the solution is not contained in it. We remark that the 
sin g ular unsymmetric case is quite different from the symmetric one. 
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In the remainder of this section we will prove that a breakdown in the inner GMRES method cannot occur 
before the total number of iterations exceeds the dimension of the Krylov space K(A; ro). This means that, 
in practice, a breakdown will be rare. Furthermore, we will show how such a breakdown can be overcome. 

We will now define breakdown of the Krylov space for the inner GMRES iteration more formally. 

Definition 1 We say there is a breakdown of the Krilov sub space in the inner GMRES iteration if 
A k v m E range(V m ) , since this implies we can no longer expand the Krylov subspace. We call it a lucky 
breakdown if v\ E range(A k V rn ) , because we then have found the solution (the inverse of A is known 
over the space range(A k V m ) ). We call it a true breakdown if v i £ range(A k V rn ) , because then the 
solution is not contained in the Krylov subspace. 


The following theorem relates true breakdown to the invariance of the sequence of subspaces in the inner 
method for the operator A k . Part four indicates that it is always known whether a breakdown is true or 
lucky. 


Theorem 2 The following statements are equivalent: 

1. A true breakdown occurs in the inner GMRES iteration at step m; 

2. range(A k V m -i) is an invariant subspace of A k ; 

3. A k v m E range(A k V m ~i) ; 

4. A k V m — V m H m , and Hrn is a singular m x m matrix. 

From theorem 1, one can already conclude that a true breakdown occurs if and only if A k is singular over 
K m (A k ] r k ). From the definition of A k we know null(A k ) = range(U k ) . We will make this more explicit 
in the following theorem, which relates true breakdown to the intersection of the inner search space and the 
outer search space. 

Theorem 3 A true breakdown occurs if and only if 

range(V m ) D range(U k ) ± {0}. 


The following theorem indicates that no true breakdown in the inner GMRES iteration can occur before 
the total number of iterations exceeds the dimension of the Krylov space K(A;ro). 

Theorem 4 Let m = di m (K (A; ro)) and let l be such that r k = Vi(A)ro for some polynomial Vi of degree 
l. Then 

dim(K J+1 (yl fc ; ro)) = j + 1 forj + l<m 

and therefore no true breakdown occurs in the first j steps of the inner GMRES iteration. 


We will now show how a true breakdown can be overcome. There are basically two ways to continue: 
In the inner iteration: by finding a suitable vector to expand the Krylov space. 
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In the outer iteration: by computing the solution of the inner iteration just before the true breakdown 
and then by making one LSQR-step (see below) in the outer iteration. 

We will consider the continuation in the inner GMRES iteration first. The following theorem indicates how 
one can continue the generation of the Krylov space K(A\ r^) if in the inner GMRES iteration a true 
breakdown occurs. 

Theorem 5 If a true breakdown occurs in the inner GMRES iteration then 

3c £ range(Ck ) : AkC 0 range(AkV m -i) (13) 

This implies that one can try the vectors Ci until one of them works. However, one should realize that the 
minimization problem ( 8 ) is slightly more complicated. 

Another way to continue after a true breakdown in the inner GMRES iteration is to compute the inner 
iteration solution just before the breakdown and then apply an LSQR-switch (see below) in the outer GCR 
iteration. The following theorem states the reason why one has to apply an LSQR-switch. 

Theorem 6 Suppose one computes the solution of the inner GMRES iteration just before a true 
breakdown. Then stagnation will occur in the next inner iteration , that is Tk+i -L K(A^i; r^i). This will 
lead to a breakdown of the outer GCR iteration. 

The reason for this stagnation in the inner GMRES iteration is that the new residual rjt+i remains in the 
same Krylov space K(A^] rjt), which contains auG range{Uk) . So we have to ‘leave’ this Krylov space. 
We can do this using the so-called LSQR-switch, which was introduced in [15], to remedy stagnation in the 
inner GMRES iteration. Just as in the GMRESR method, stagnation in the inner GMRES iteration will 
result in a breakdown in the outer GCR iteration, because the residual cannot be updated. The following 
theorem states that this LSQR-switch actually works. 

Theorem 7 If stagnation occurs in the inner GMRES iteration , that is if 
min^ € RT 7 I ||r* + i - AkV m y\\2 = Hnt+ilh, then one can continue by setting (LSQR-switch) 

Ck+ 2 = lA k+ iA T r k+ \ and (14) 

Ufc +2 = 7(J~ Qk+i)A T r k+ i, (15) 

where 7 = H^'fc 4 2 II 2 • This leads to 

r k +2 = r fc+l — ( r fc+l c k+2) c fc+2 an d ( 16 ) 

Xfc+2 = Zjfc+1 — (r^ + iCfc + 2)«fc+2, (17) 

which always gives an improved approximation. Therefore, these vectors can be used as the start vectors for 
a new inner GMRES iteration. 


IMPLEMENTATION 


We will now describe how to implement these methods efficiently (see also [2], [7]). First we will discuss the 
outer GCR iteration and then the inn er GMRES iteration. The implementation of a method like 
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BICGSTAB in the inner iteration will then be obvious. Instead of the matrices U k and C k we will use in 
the actual implementation the matrices U k , C k , N kl Z k and the vector dk which are defined below. 


Definition 2 The matrices U k , C k , N k , Z k and the vector d k are defined as follows. 

C k = C k N k , where (18) 

N k = diag(\\c 1 \^\\\c 2 \\^,...,\\c k \\^), (19) 

AU k = C k Z k , (20) 

where Z k is assumed to be upper-triangular. Finally is defined by the relation 

Tk = 7*0 “ Ckdk ( 21 ) 

From this the approximate solution x k , corresponding to r^, is implicitly represented as 

x k = x 0 + U k Zk 1 d k . ( 22 ) 


Using this relation x k can be computed at the end of the complete iteration or before truncation (see next 
section). The implicit representation of U k saves all the intermediate updates of previous Ui to a new u k + i, 
which is approximately 50% of the computational costs in the outer GCR iteration (see (11) and (12)). 

GMRES os inner iteration. After k outer GCR iterations we have U k ^pk r k . Then, in the inner 
GMRES iteration, the orthogonal matrix V m +\ is constructed such that Cf^Vm+i = O and 

AV m — C k B m + 

Bm - NlClAV m 

This algorithm is equivalent to the usual GMRES algorithm, except that the vectors Av% are first 
orthogonalized on C k . From (23) and (24) it is obvious that AV m — C k B m = A k V m — V m +iH m (cf. 
theorem 1). Next we compute y according to (8) and we set (cf. (11) without normalization): 

Cjfc +1 = V m +\ H m y (25) 

Ufc+l = VmV- (26) 

This leads to Au k+l = AV m y = C k B m y + V m+ i H m y = C k B m y + c fc+ i, so that if we set z k+1 = (( B m y) T 1) T 
the relation AU k +\ = C k +\Z k +\ is again satisfied. It follows from theorem 1 that the new residual of the 
outer GCR iterations is equal to the final residual of the inner iteration r k +\ = r™ neT and is given by 
rfc +1 = r k - c k + 1 , so that d k +\ = 1. Obviously the residual norm only needs to be computed once. If we 
replace, in the formula above, the new residual of the outer GCR iteration r^+i by the residual of the inner 
GMRES iteration r^J ner , we see an important relation that holds more generally c k +\ = r k — r™ ner . This 
relation is important, since in general (when other Krylov methods are used for the inner iteration) Cfc+i or 
c k +i cannot be computed from u k +\, because i is not always computed explicitly, nor does a relation 
like (25) always exist. Finally, we need to compute the new coefficient of N k . ||cfc+i|| 2 1 in order to satisfy 
the relations in definition 2. 


(23) 

(24) 


TRUNCATION 


In practice, since memory space may be limited and since the method becomes increasingly expensive for 
large k (the number of outer search vectors), we want to truncate the set of outer iteration vectors (tii) and 
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(ci) at k = k maxy where k max is some positive integer. Basically, there are two ways to do this: one can 
discard one or more iteration vector(s) (dropping) or one can assemble two or more iteration vectors into 
one single iteration vector (assembly). We will first discuss the strategy for truncation and then its 
implementation. 


A strategy for Truncation. In each outer GCR iteration step the matrices U \ and C k are augmented with 
one extra column. To keep the memory requirement constant, at step k = A; mox , it is therefore sufficient to 
diminish the matrices U krnax and by one column. Prom (22) we have x* = xo + UkZ k l dk* Denote 

£ k = Z k l dk- Consider the sequence of vectors (£*). The components G f these vectors £* are the 
coefficients for the updates Ui of the approximate solution x^. These coefficients converge to the limits 
as k increases. Moreover, (£*^) converges faster than (f*^), and (£*^) converges faster than (£*/ 3 )) 
etc. . Suppose that the sequence (£fc^) has converged to within machine precision. From then on it 
makes no difference for the computation of x k when we perform the update xo + . In terms of 

direction vectors this means that the outer direction vector u\ will not reenter as component in the inner 
iteration process. Therefore one might hope that discarding the vector c\ will not spoil the convergence. 
This leads to the idea of dropping the vector c\(= Au\) or of assembling c\ with C 2 into c (say) when 


m = 


M M 

S/c-l ~ Sfc 



< e, 


(27) 


where e > 0 is a small constant. The optimal e, which may depend on fc, can be determined from 
experiments. When 6 (k) > e we drop c* mox _ 5 or we assemble cjfc max _ 1 and c krnax (of course other choices are 
feasible as well, but we will not consider them in this article). With this strategy we hope to avoid 
stagnation by keeping the most relevant part of the subspace ran< 7 e(C*;) in store as a subspace of 
dimension k — 1. In the next subsections we describe how to implement this strategy and its consequences 
for the matrices C & and U& * 


Dropping a vector. Let 1 < j < k = k max . Dropping the column Cj is easy. We can discard^ it without 
consequences. So let C k _ 1 be the matrix (7* without the column Cj. Dropping a column from U k needs 
more work, since Xfc is computed as x * = xq + UkZ k l d k . Moreover, in order to be able to apply the same 
strategy in the next outer iteration we have to be able to compute in a similar way. For that purpose, 
assume that x* can be computed as 


x k = 4-1 = x o + U'k- i(4-l) X-i, (28) 

where U k _ x and Z / k _ 1 are matrices such that AU k _ x = C k _ x Z k _ x (see (20)). These matrices U k _ 1 and Z k _ l 
are easily computed by using the j - th row of (20) to eliminate the j-th column of C k in (20). In order to 
determine Xq and d k _ x we introduce the matrix U k = A~ l Ck = U^Z ^ . This enables us to write 


Xk = (xo + 4 pUj) + 


M, 


3-1 


and Uj = (uj — ^Zqu^Jz 


■33- 


(29) 


i = l 


i= 1 


Substituting the equation for uj into the equation for Xk we can compute X* from 


iU) 


°33 


x k = ( xo + + ]T( 4 l) - 4^7 -)*i + d k )u i- 

i=i z a i=j+ 1 


(30) 


cjj 
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Notice that this equation precisely defines x' 0 and d' k _ 

®o = 4' {dk^fzjj)i 

4- 1 - 4° - 4 ’Hzij/zjj) for i = 1, . . . , j - 1 and (31) 

4-i' = 4 +1) for i = j, . . . , A: - 1. 

Now we have deallocated two vectors and we compute Xk as in (28). We can continue the algorithm. 

Assembly of two vectors. Let \ < j <1 <k = k max . Again assembling Cj and q is easy. Let 
C = (4 J) Cj + d^ci) overwrite the l-th column of C fc . Then, let C l k _ l be this new matrix Ck without j-th 
column. Analogous to the above, we wish to compute Xk as (28). For the purpose of determining the 
matrices U k _ 1 and Z' k _ 1 , let u = (d k ^Uj + d k ^ui) and compute and t^ such that 

ZjmUj + z lm ui + t^uj = which gives t[ m ^ = z lm {d^ jd^ k ) - z jm and t j, m) = zj m /4^ • This enables us 

to write u m - i z imUi, for m = 1, . . j - 1 and 

771 

Um = z »rn«i 4- 4 m) « 

,=i 

Substituting Uj = ( iij — J2i=i z ijUi)/zjj , to eliminate Uj from (32) we get u m = z imUi, for 
m = 1, . . . ,j — 1 and 

Am) m 

Um + - J — Uj = ( Z i ™ + t[ m 

Z 33 i.i 

* A ?>1 

This equation determines the matrices U' k _ l and Z k _y In order to determine x' 0 and d' k _ v note that Xk can 
be computed as 

k 

Xk = 4- ^2 d^Ui + u. (34) 

>-i 

Therefore x' 0 is just xo and d' k _ 1 equals the vector dk without the j-th element and the I-th element 
overwritten by 1. Similarly, as before, we have deallocated two vectors from memory. The assembled 
vectors ii and c overwrite tq and cj. The locations of uj and Cj can therefore be used in the next step. 
Finally, we remark that these computations can be done with rank one updates. 


^—*)Ui 4- t^u for m = j + 1 , . . . , k. (33) 

Zjj 


4 m) «i , for m = j, . . . , k. 


(32) 


NUMERICAL EXPERIMENTS 


We will discuss the results of some numerical experiments, which concern the solution of two dimensional 
convection diffusion problems on regular grids, discretized using a finite volume technique, resulting in a 
pentadiagonal matrix. The system is preconditioned with ILU applied to the scaled system; see [3], [9]. The 
first two problems are used to illustrate and compare the following solvers: 

• (full) GMRES; 

• BICGSTAB; 

• GMRESR(m), where m indicates the number of inner GMRES iterations between the outer iterations; 

• GCRO(m), which is GCR with m adapted GMRES iterations as inner method, using Ak : 

• GMRESRSTAB, which is GMRESR with BICGSTAB as inner method; 
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Figure 3: Convergence history for problem 1 


Figure 4: Convergence in time for problem 1 


• and GCROSTAB, which is GCR with the adapted BICGSTAB as inner method, using A k . 

We will compare the convergence of these methods both with respect to the number of matrix vector 
products and with respect to CPU-time on one processor of the Convex 3840. This means e.g. that each 
step of BICGSTAB (and variants) is counted for two matrix vector products. We give both these 
convergence rates because the main trade off between (full) GMRES, the GCRO variants and the 
GMRESR variants is less iterations against more dot products and vector updates per iteration. Any gam 
in CPU-time then depends on the relative cost of the matrix vector multiplication and preconditioning 
versus the orthogonalization cost on the one hand and on the difference in iterations on the other hand. We 
will use our third problem to show the effects of truncation and compare two strategies. 

Problem 1. This problem comes from the discretization of 

~{UXX + Uyy) +bu x +CUy=0 

on [0, 1] x [0,4], where 

J 100 for 0 < y < 1 and 2 < y < 3 
b{x,y) = j _ 100 for i< y <2 and 3<y<4 

and c = 100. The boundary conditions are it = 1 on y = 0, u = 0 on y = 4, u' = 0 on i = 0 and u' = 0 on 
x-\, where v! denotes the (outward) normal derivative. The stepsize in x-direction is 1/100 and in 
y-direction is 1/50. 

In this example we compare the performances of GMRES, GCRO(m) and GMRESR(m), for m 5 and 
77 i = io. The convergence history of problem 1 is given in Fig. 3 and Fig. 4. Fig. 3 shows that GMRES 
converges fastest (in matrix vector products), which is of course to be expected, followed by GCRO(5), 
GMRESR(5), GCRO(IO) and GMRESR(IO). From Fig. 3 we also see that GCRO(m) converges smoother 
and faster than GMRESR(m). Note that GCRO(5) has practically the same convergence behavior as 
GMRES. The vertical ‘steps’ of GMRESR(m) are caused by the optimization in the outer GCR iteration, 
which does not involve a matrix vector multiplication. We also observe that the GMRESR(77i) variants 
tend to lose their superlinear convergent behavior, at least during certain stages of the convergence history. 
This seems to be caused by stagnation or slow convergence in the inner GMRES iteration, which (of 
course) essentially behaves like a restarted GMRES. For GCRO(m), however, we see a much smoother and 
faster convergence behavior and the superlinearity of (full) GMRES is preserved. This is explained by the 
‘global’ optimization over both the inner and the outer search vectors (the latter form a sample of the 
entire, previously searched Krylov subspace). So we may view this as a semi-full gmres. Fig. 4 gives the 
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Figure 5: Convergence history for problem 2 Figure 6: Convergence history for BICGSTAB 

variants for problem 2 




Figure 7: Convergence in time for problem 2 Figure 8: Coefficients for problem 2 

convergence with respect to CPU-time. In this example GCRO(5) is the fastest, which is not surprising in 
view of the fact that it converges almost as fast as GMRES, but against much lower costs. Also, we see 
that GCRO(IO), while slower than GMRESR(5), is still faster than GMRESR(IO). In this case the extra 
orthogonalization costs in GCRO are outweighed by the improved convergence behavior. 

Problem 2 . This problem is taken from [14]. The linear system comes from the discretization of 


( aUx) x ifl u y)y “b — / 


on the unit square, with b = 2exp2(x 2 + y 2 ). Along the boundaries we have Dirichlet conditions: u = 1 for 
y = 0, x = 0 and x = 1 , and u = 0 for y = 1 . The functions a and / are defined as shown in Fig. 8; / = 0 
everywhere, except for the small subsquaxe in the center where / = 100. The stepsize in x-direction and in 
y-direction is 1/128. 

If Fig. 5 a convergence plot is given for (full) GMRES, GCRO(m) and GMRESR(m). We used m— 10 and 
m = 50 to illustrate the difference in convergence behavior in the inner GMRES iteration of GMRESR(m) 
and GCRO(m). GMRESR(50) stagnates in the inner GMRES iteration whereas GCRO(50) more or less 
displays the same convergence behavior as GCRO(IO) and full GMRES. For the number of matrix vector 
products, it seems that for GMRESR(m) small m are the best choice. 
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In Fig. 6 a convergence plot is given for (fall) GMRES, BICGSTAB, and the the BICGSTAB variants, 
GMRESRSTAB and GCROSTAB. To our experience the following strategy gave the best results for the 
BICGSTAB variants: 

• For GMRESRSTAB we ended an inner iteration after either 20 steps or a relative improvement of the 
residual of 0.01; 

• For GCROSTAB we ended an inner iteration after either after 25 steps or a relative improvement of the 
residual of 0.01. 

The convergence of GMRESRSTAB for this example is somewhat typical for GMRESRSTAB in general 
(albeit very bad in this case). This might be explained from the fact that the convergence of BICGSTAB 
depends on a ‘shadow’ KryTov subspace, which it implicitly generates. Now, if if one restarts, then 
BICGSTAB also starts to build a new, possibly different, ‘shadow’ Krylov subspace. This may lead to 
erratically convergent behavior in the first few steps. Therefore, it may happen that, if in the inner 
iteration BICGSTAB does not converge (to the relative precision), the Solution’ of the inner iteration is 
not very good and therefore the outer iteration may not give much improvement either. At the start the 
same more or less holds for GCROSTAB; however, after a few outer GCR iterations the ‘improved’ 
operator ( A* ) somehow yields a better convergence than BICGSTAB by itself. This was also observed for 
more tests, although it also may happen that GCROSTAB converges worse than BICGSTAB. 

In Fig. 7 a convergence plot versus the CPU-time is given for GCROSTAB, BICGSTAB, GCRO(IO) and 
GMRESR(IO). The fastest convergence in CPU-time is achieved by GCROSTAB(IO), which is » 20% 
faster than BICGSTAB notwithstanding the extra work in orthogonalizations. We also see, that although 
GCRO(IO) takes fewer iterations than GMRESR(IO), in CPU-time the latter is faster. So in this case the i 

decrease in iterations does not outweigh the extra work in orthogonalizations. For completeness we mention 
that GMRESRSTAB took almost 15 seconds to converge, whereas GMRES took almost 20 seconds. 

Problem 3 . The third problem is taken from [10]. The linear system stems from the discretization of the 
partial differential equation 

~u xx - u yy + 1000(xu j: + yuy) + 10u — f 

on the unit square with zero Dirichlet boundary conditions. The stepsize in both x-direction and 

y-direction is 1/65. The right-hand side is selected once the matrix is constructed so that the solution is j 

known to be x = (1, 1, ... , 1) T . The zero vector was used as an initial guess. 

In Fig. 9 we see a plot of the convergence history of full GMRES, GMRESR(5), GCRO(5) and 
GCRO(10,5) for two different truncation strategies, where the first parameter gives the dimension of the 
outer search space and the second the dimension of the inner search space. The number of vectors in the : 

outer GCR iteration is twice the dimension of the search space. For the truncated version: 

• ‘da’ means that we took e = 10~ 3 and dropped the vectors u\ and c\ when S(k) < e and assembled the 

vectors ug and uio as well as the vectors eg and cio when 6 (k) > e; 1 

• ‘tr’ means that we dropped the vectors ug and eg each step (e = 0, see also [16]). 

Notice that GCRO(5) displays almost the same convergence behavior as full GMRES. GMRESR(5) 
converges eventually, but only after a long period of stagnation. The truncated versions of GCRO(5) also 
display stagnation, but for a much shorter period. After that the ‘da’ version seems to converge as 
superlinear, whereas the ‘tr’ version still displays periods of stagnation, most notably at the end. This 
indicates that the ‘da’ version is more capable of keeping most of the ‘convergence history’ than the ‘tr’ 
version. This kind of behavior was seen in more tests: ‘assembled’ truncation strategies seem to work 
better than just discarding one or more iteration vectors. 

In Table 1 we give the number of matrix vector products, the number of memory vectors and the I 

CPU-time on a Sun workstation. From this table we see that GCRO(5) is by far the fastest method and | 

| 

E 
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Figure 9: Convergence history for problem 3 


uses about half the amount of memory vectors full GMRES and GMRESR(5) use. More interesting is that 
GCRO(10,5) ‘da’ converges in the same time as GMRESR(5), but uses only one third of the memory space. 


CONCLUSIONS 


We have derived from the GMRESR inner-outer iteration schemes a modified set of schemes, which 
preserve the optimality of the outer iteration. This optimality is lost in GMRESR since it essentially uses 
‘restarted’ inner GMRES iterations, which do not take advantage of the outer ‘convergence history’. 
Therefore, GMRESR may loose superlinear convergence behavior, due to stagnation or slow convergence of 
the inner GMRES iterations. 


Method 

Mat-Vee 

Memory Vectors 

CPU-time 

GMRES 

77 

77 

21.3 

GMRESR(5) 

188 

81 

18.5 

GCRO(5) 

83 

39 

9.4 

GCRO(10,5) ‘da’ 

150 

25 

18.3 

GCRO(10,5) ‘tr’ 

244 

25 

30.3 


Table 1: Number of matrix vector products, number of memory 
vectors and CPU-time in seconds for problem 3 
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In contrast, the GCRO variants exploit the ‘convergence history’ to generate a search space that has no 
components in any of the outer directions in which we have already minimized the error. For GCRO(m) 
this means we minimize the error over both the inner search space and a sample of the entire previously 
searched Krylov subspace (the outer search space), resulting in a semi-full GMRES. This probably leads to 
the smooth convergence (much like GMRES) and the absence of stagnation, which may occur in the inner 
GMRES iteration of GMRESR. Apparently the small subset of Krylov subspace vectors that is kept 
approximates the entire Krylov subspace that is generated, sufficiently well. For both GMRESR(m) and 
GCRO(m) it seems that a small number of inner iterations works well. 

We may also say, that the GCRO variants construct a new (improved) operator (of decreasing rank) after 
each outer GCR iteration. Although there is the possibility of breakdown in the inner method for GCRO, 
this seems to occur rarely as is indicated by theorem 4 (it has never happened in any of our experiments) . 

With respect to performance of the discussed methods we see that GCRO(m) (almost) always converges in 
fewer iterations than GMRESR(m). Because GCRO(m) is on average more expensive per iteration, this 
does not always lead to faster convergence in CPU-time. This depends on the relative costs of the matrix 
vector product and preconditioner w.r.t. the cost of the orthogonalizations and the reduction in iterations 
for GCRO(m) relative to GMRESR(m). Our experiments, with a cheap matrix vector product and 
preconditioner, show that already in this case the GCRO variants are very competitive with other solvers. 
However, especially when the matrix vector product and preconditioner are expensive or when not enough 
memory is available for (full) GMRES, GCRO(m) is very attractive. GCRO with BICGSTAB also seems 
to be a useful method, especially when a large number of iterations is necessary or when the available 
memory space is small relative to the problem size. GMRESR with BICGSTAB does not seem to work so 
well, probably because, to our observation, restarting BICGSTAB does not work so well. 

We have derived sophisticated truncation strategies and shown by example that superlinear convergence 
behavior can be maintained. From our experience, the ‘assembled’ version seems to have the most promise. 

Acknowledgements. The authors are grateful to Gerard Sleijpen and Henk van der Vorst for 
encouragement, helpful comments and inspiring discussions. 
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IMPLEMENTING ABSTRACT MULTIGRID OR MULTILEVEL METHODS * 


Craig C. Douglas 
Department of Computer Science 
Yale University 
New Haven, Connecticut 

SUMMARY 

Multigrid can be formulated as an algorithm for an abstract problem that is independent of the 
partial differential equation, domain, and discretization method. In such an abstract setting, problems 
not arising from partial differential equations can be treated also (c.f. aggregation-disaggregation 
methods). Quite general theory exists for linear problems, e.g., C. C. Douglas and J. Douglas, SIAM 
J. Numer. Anal., 30 (1993), pp. 136-158. 

The general theory was motivated by a series of abstract solvers (Madpack) . The latest version (4) 
was motivated instead by the theory. Madpack now allows for a wide variety of iterative and direct 
solvers, preconditioners, and interpolation and projection schemes, including user callback ones. It 
allows for sparse, dense, and stencil matrices. Mildly nonlinear problems can be handled. Also, there 
is a fast, multigrid Poisson solver (two and three dimensions). 

The type of solvers and design decisions (including language, data structures, external library 
support, and callbacks) are discussed here. Based on the author’s experiences with two versions of 
Madpack, a better approach is proposed here. This is based on a mixed language formulation (C and 
Fortran+preprocessor) . Reasons for not just using Fortran, C, or C++ are given. Implementing the 
proposed strategy is not difficult. 


1. INTRODUCTION 


The term abstract multigrid was coined in [1]. This refers to theory which is quasi-independent of 
the elliptic boundary value problem. The dependence is introduced by assuming that the (discretized) 
problem satisfies a very small number of hypotheses which contribute simple expressions to the 
convergence rate formula. The theory in [1] is general enough to apply to nonnested solution spaces 
and includes example boundary value problems on general domains, with variable coefficients, and 
finite difference and finite element discretizations. 

The concept of abstract multigrid was pushed to the extreme in [2], where a general theory for 
linear problems is presented with virtually no constraints on the origin of the problems. 

Abstract multigrid is defined in §2. Two implementations of abstract multilevel methods (see [3] 
and [4]) are discussed in §3. A discussion of what might be the right set of languages to implement 

‘This work was supported in part by IBM and the Office of Naval Research. 
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abstract multilevel methods is in §4. Finally, some conclusions are drawn in §5. 


2. ABSTRACT MULTIGRID 

Assume we are solving some problem, possibly derived from a partial differential equation, possibly 
not. Assume further that by various means a sequence of (linear) problems 

AjXj = bj, 1 < j < k, (1) 

are formed which approximate the real problem 

AfcX/t = j (2) 

where Xj,bj G Mj , 1 < j < k. Typically, M } is a real or complex vector space when actually com- 
puting the solution to the problem. Frequently, 

dim(A4j) « C dim(Adj-i), C > 1. 

There are typically three mappings between the neighboring solution spaces. 

/ Hi, Qj- Mj — > Mj-i, 2 <j<k, 

\ Vj : a, l<j<k-l. 

The IZj and Qj are restriction (or projection) matrices and the Vj are prolongation (or interpolation) 
matrices. Frequently, Vj = cR?_ x , where c G IR. The matrices A, and Aj_ x are typically related 
through the relation 

A,--i = QjAjVj-u 2<j<k. 

The Galerkin form of multigrid requires that Qj = Vj_ x . The Qj are frequently injection matrices 
when a finite difference discretization is applied to a partial differential equation. 

A multilevel correction algorithm is simply defined by 

Algorithm MGC ( lev , {Aj,Xj,bj}j =x , {Vj }jZ\, ) 

1- Xi t V * SolveTi ev (A lev ,xi ev ,bi ev ) 

2. If lev > 1, then repeat 2a-2d until some condition is met: 

2a. Xi ev - 1 ♦ 0, bi ev -i * V^ietiibiev A/ ev X/ ev ) 

2b. MGC ( lev - 1, {Aj,Xj,bj} k j=x , {Pj})zl ) 

2c. Xiev * X Xev -f" V[ev—\Xi ev _ x 
2d. xi ev +— Solver i ev{-^lev> •Elevi bi ev ) 

A common condition in step 2 is to do steps 2a-2d some specified number of times (e.g., 0 for one 
way multigrid, 1 for a V Cycle, or 2 for a W Cycle). 

On the coarsest level, lev — 1, the solver is frequently some flavor of Gaussian elimination (e.g., a 
sparse one). Common solvers on the other levels include relaxation methods (e.g., point, fine, plane, 
or zebra Gauss-Seidel) and conjugate direction methods (e.g., conjugate gradients or residuals, CGS, 
GMRES, or Orthomin). The latter class of iterative methods is most effective on highly nonuniform 
meshes with a significant difference between the largest and smallest mesh spacing or diameter on a 
level. 

A general algorithm that provides very good initial guesses is the nested iteration one: 
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Algorithm NIC ( lev, {A 1 ,z„b,) k 1 , x , {P, {R,}}., ) 

1. MGC ( 1, {A, {?>}£;>, {R,}‘, 2 ) 

2. Do steps 2a-2b with lev = 2, • • • , k: 

2a. X[ ev * 'Plev — \Xi ev —\ 

2b. MGC ( lev, {Vtfz}, {R,}‘ =2 ) 

A one way multilevel algorithm means that Algorithm MGC never performs any portion of its step 2 
as part of its use by Algorithm NIC. Most complexity arguments showing that multigrid is of optimal 
order are based on Algorithm NIC, not Algorithm MGC. 

For nonlinear problems, there are two standard approaches: the Full Approximation Scheme 
(FAS) and damped Newton multilevel. FAS is similar to Algorithm MGC, but changes two lines: 

2a. Xi ev _i 7Z( ev \ Xi ev , 6/eti-l <— 7llev(btev ~ Ai ev Xi ev ) ~ A lev _iXi ev - 1 
2c. Xi ev < X[ ev + 'Pie v-l( x lev-l ~ 'fyev ^ lev ) 

Note that in many situations 7Z\ ev ^ = 7Zi ev . Also, the operator Aj is not linear anymore, but involves 
function evaluations. 

The damped Newton algorithm is a modification of Algorithm NIC. Before each reference to 
Algorithm MGC, a Jacobian is formed and a damped Newton step is performed. The last Jacobian 
on a level is saved for use in subsequent multilevel correction steps. 

The difference between these two nonlinear approaches is easy to categorize. FAS uses a nonlinear 
iterative method (e.g., nonlinear Gauss-Seidel) while damped Newton uses standard linear solvers. 
When evaluating the nonlinear function is inexpensive, FAS usually produces an approximate solution 
faster than the damped Newton multilevel method. However, when the function evaluations are 
expensive, the damped Newton multilevel method usually produces an approximate solution faster 
than FAS. 

Note that in Algorithms MGC and NIC, there are only two obvious components per level: the 
solver and the methods for passing information between levels. There are other components hidden 
by this formulation: a possible set of preconditioners for use by the solvers, a method for computing 
a matrix-vector product for some set of storage formats, and a set of discretization methods in the 
partial differential equation case. 

For problems not arising from partial differential equations, the only components in Algorithm 
MGC that can be optimized are the solvers and the restriction matrices Q } and 71 j. Both theory and 
practical experience demonstrate rather conclusively that finding better Qj matrices is far superior 
to trying to find an optimal iterative method as the solver (e.g., see [5]). 

For partial differential equation problems, using better discretization methods usually makes a 
higger impact on the convergence rate than searching for a slightly better interpolation scheme or 
iterative solver. There are exceptions to this for trivial problems (e.g., Laplace’s equation on a square 
with uniform grids). 


3. MADPACK 


The term madpack is a mnemonic for mulUgnd (multilevel) , aggregation -disaggregation package. 

It started as a compact set of subroutines for solving problems of the form (l)-(2). The first two 
versions were released in 1986 and the fourth in 1992. All versions have been written using numerous 
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macros to hide data structures and improve the readability. Currently, version 2 is availab e roug 
Netlib and MGNet (see [6] and [7] for a description of MGNet). Version 2 is m the public domain. 
Version 4 is not really compatible with version 2 and is also owned by IBM. It is available through 
IBM’s Internet anonymous ftp server and MGNet. All announcements an ug es or version 
are distributed through MGNet. 

Version 2 is discussed in §3.1. Version 4 is discussed in §3.2. A number of issues that these two 
versions raise are discussed in §4. 

3.1. MADPACK, VERSION 2 


Version 2 [8] was originally written in an extended flavor of Ratfor. A translator converted this 
to Fortran-77. This, in turn, is compiled by whatever compiler is available on a given machine. After 
determining that on some machines (e.g., SUN workstations in 1986) C versions of he “5^“ 
ran up to 40% faster than the Fortran-77 equivalent, the entire code was ported to C. Inciud g 
comments, there are only 1500-1600 lines in each language version. All three language versions are 

distributed. 

Version 2 consists of 9 subroutines: 


Routine 

Description 

klmg 

klni 

klax 

kldsnf 

kldsss 

klres 

klsgs 

klsgse 

klsgsm 

Algorithm MGC 
Algorithm NIC 
matrix-vector multiply 
factor matrices 
forward/backward solves 
compute residual 
Symmetric Gauss-Seidel 
Preconditioned conjugate gradients 
Preconditioned Orthomin(l) | 


The first two subroutines, klmg and Uni, are meant to be the only user callable subroutines, but any 

can be called directly. . . . , . . rpi „ 

Version 2 supports an odd flavor of sparse matrix storage (see [9]) m the solver routines. The 

matrices Aj are assumed to have a symmetric nonzero structure, independent of whether or no 
A - =A T This means that in some cases, a small number of zeroes are actually stored m the sp 
matrix 'representation of Aj. The main diagonal, the nonzero elements of the columns of the upper 
triangular part of Aj, and the nonzero elements of the rows of the Ibwer triangular part _ of Aj ^e 
storef independently (the lower part only if Aj is nonsymmetnc). This allows foronlyhalfoft 
row or column indices to be stored due to the symmetry of the nonzero structure. It also al lows i for 
numerous computational simplifications and some tricks in reducing costs in the direct and iterative 

^For restriction and prolongation matrices, two additional storage formats are supported. A general 
sparse matrix format, as implemented in the second Yale Sparse Matrix Package (see [11]) 1S US <. 
irregular grids. A stencil format is extremely efficient for uniform or tensor product grids. Typically, 
T j + c storage elements are used, where t } =Rows (Tlj) and c is a small natural num er. 
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Table 1: Solvers and preconditioners 


Solver 

None 

User 

Preconditions 
ILU Diag 

SGS 

SSOR 

NoSolver 

* 

* 

* 

* 

* 

* 

User 

any 

any 

* 

* 

* 

* 

Factor 

GD 

* 

* 

* 

* 

* 

Solve 

GD 

* 

* 

* 

* 

* 

Symmetric Gauss-Seidel 

G 

* 

* 

* 

* 

* 

Gauss-Seidel 

GSD 

* 

* 

* 

* 

* 

Gauss-Seidel, red-black 

GSD 

* 

* 

* 

* 

* 

Conjugate gradients 

GSD 

GSD 

G 

G 

G 

G 

Minimum residuals 

GSD 

GSD 

* 

* 

G 

* 

CGS 

G 

* 

G 

G 

* 

G 

CGSTAB 

G 

* 

G 

G 

* 

G 

GMRES 

G 

* 

G 

G 

* 

G 


* = Error 

G = General sparse matrices 

S = Stencil matrices 

D = Dense matrices 

any = any format 


Only Algorithms MGC and NIC are included. There is no support for nonlinear or time dependent 
problems. Version 2 has been imbedded in other people’s nonlinear and time dependent codes, 
however. There is also no user callback mechanism, so that if the user has some special solver, 
preconditioner, or change of level subroutine, the source code for version 2 has to modified. 

3.2. MADPACK, VERSION 4 

This is a complete redesign and rewrite of Madpack. It is incompatible with version 2 in numerous 
ways. This is actually two quite distinct codes, DAMG [3] and DPMG [4], DAMG is an abstract 
solver for linear and mildly nonlinear problems (FAS is supported). DPMG is a fast Poisson solver 
for two and three dimensional problems on simple uniform or tensor product grids. 

DAMG supports dense, stencil, and general sparse matrix formats (this time, the more common 
first Yale Sparse Matrix Package [12] format was used) in the computational kernels. The dense case 
rarely occurs in solving partial differential equations; it is more common when solving aggregation- 
disaggregation problems (see [5]). Table 1 contains a summary of the solvers and preconditioners 
supported in the IBM version. 

Unlike version 2, version 4 requires an external library of solvers (there are some solvers provided, 
but not many). What is distributed by IBM runs only on machines with their proprietary engineering 
and scientific subroutine library. Currently, this library only runs on IBM mainframes and RISC 
System/6000 workstations. Since DAMG was originally written on a machine that is not supported 
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Table 2: Level independent information data structure 


iparm(i) 

i 

Symbolic name 

Definition 

i 

mgfn 

Which multilevel algorithm 

2 

I2in fm 

Second dimension of infm array 

3 

bxsize 

Length of b and x arrays 

4 

Indrn 

Length of dm array 

5 

Inirn 

Length of im array 

6 

Injm 

Length of jm array 

7 

lev elf 

Index of the finest level 

8 

levelc 

Index of the coarsest level 

9 

startl 

Index of the starting level 

10 

presva 

Preserve coarsest level’s matrices or not 

11 

lastdm 

Index of last element in dm in use 

12 

lastim 

Index of last element in im in use 

13 

lastjm 

Index of last element in jm in use 

14 

info 

Control of debugging information 

15 

restart 

Continued computation indicator 

20 

assist 

When all else fails 


by this library, there is obviously a version which uses other libraries, e.g., LAPACK and the first 
Yale Sparse Matrix Package. Interfacing DA MG to other libraries is now fairly painless. 

DAMG accepts three external subroutine arguments in case the users want to use their own 
solver (s), preconditioner(s), or change of level subroutine(s). In retrospect, there should have been 
a fourth for matrix-vector multiplies. These features are used extensively in DPMG to avoid storing 

matrices. . . „ . , 

Both DAMG and DPMG axe written in the same extended Ratfor as is version 2. Only the 

Fortran-77 translation is distributed by IBM, however. The codes assume double precision real data. 
Changing to single precision only requires changing one line of a file included by each of the Ratfor 

codes. Changing to complex data is only slightly harder. 

DAMG can be restarted after it returns. This allows for coarse levels to be removed from the 
computational flow. It also allows an external adaptive grid refinement procedure to work with 

DAMG to add finer levels. ... , , , _ . 

Data is passed to and from DAMG in the standard awkward style imposed by Fortran-77 s 

limitations. Matrices and vectors are piled into a set of five (integer and real) vectors. As a substitute 
for the more natural pointer data type, indices are stored in information data arrays, indexed by the 
level number (see Tables 2-4). A language that supports more reasonable data structures, pointers, 

and dynamic memory allocation and freeing would simplify this. 

Table 2 contains information which is level independent. This includes the length and the index 
of the last used element of certain vectors, which multilevel algorithm to start with, the indices of 
the finest, coarsest, and starting levels, how much debugging information to print, and whether this 
is a restart of an earlier computation. 

Table 3 contains information relevant to the computational algorithms which is level dependent. 
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Table 3: Level dependent algorithm information data structure 


infalg(i,j ) on level j 

i 

Symbolic name 

Definition 

i 

Solver 

Which solution method 

2 

Solver Iters 

Iterations of Solver 

3 

Precond 

Which preconditioning method 

4 

MG Iters 

Iterations of Algorithm MGC or MGFAS 

5 

1 V liters 

Iterations of Algorithm NIC or NIFAS 

6 

IdxXB 

Index of first element of bj or Xj in 6 or a; 

7 

NXB 

Number of elements in bj and Xj 

8 

Colors 

Number of colors in a multicolor ordering 


Table 4: Matrix information data structure 


infm(i, k, j ) on level j 

i/k 

1 

2 

3 

4 

5 

i 

AType 

RType 

PType 

NIPType 

FASRType 

2 

ACols 

RCols 

PCols 

NIPCols 

FASRCols 

3 

ARows 

RRows 

PRows 

NIPRows 

FASRRows 

4 

ADiml 

RDiml 

PDiml 

NIPDiml 

FASRDiml 

5 

ADim2 

RDim2 

PDim2 

NIPDim2 

FASRDim2 

6 

IdxA 

IdxR 

IdxP 

IdxNIP 

IdxFASR 

7 

IdxIA 

IdxIR 

IdxIP 

IdxINIP 

IdxIFASR 

8 

IdxJA 

IdxJR 

IdxJP 

IdxJNIP 

IdxJFASR 


Table 5: How matrices are chosen for chang ing levels 


W anted 

Order of selection 

TZj 

TZj, Vj+i, and AflV T j+x 

Vj 

Pji TZj +l , and AflVj 

AflVj 

AflVj , Vj , and Vj +1 

n (FAS) 

V\ FAS \ TZj , Vj +1 , and AflVj +1 





This includes the solver and preconditioner pairing, how many iterations of the algorithms to use on 
this level, the index into the solution and right hand side vectors for x 3 and b 3 , and their lengths. 

When changing levels, it is very rare that U jt V 3 , NTV jt and nf AS) will all be defined. A flV, 
corresponds to a special version of V 3 in step 2a in Algorithm NIC (see §2). Usually only one or two 
of these will be defined. Further, the matrices are typically related to each other m very particular 
ways mathematically. An effort has been made to allow users of DAMG the option of generating 
only one matrix when it can be re-used or is the transpose of another matrix. DAMG determines 
which operation is wanted and then determines from information in the (three dimensional) mfm 
data structure (see Table 4) how to change levels. Table 5 contains the order of choice, as determined 
by which matrix is wanted. The user callback for changing levels is the last choice unless the matrix 

type specifies doing this. . , 

DPMG uses DAMG to do multileveling. Specialized solvers, interpolation, and projection sub- 
routines are used throughout the computations, however. This means that DPMG does not store 
matrices normally, thus saving enormous amounts of memory which can be used instead for solving 
much larger problems. DPMG solves 

—A u = b in Q, 

< u = 5 o on dDo, 
u n = 9i on dDi, 

where Sfio O clfli = and clfio O 3Q\ = 0. 

This is discretized on grids 

n = mjdfioUdDi. 

In essence, linear systems of the form (l)-(2) are solved approximately for a sequence of grids Qj. 
The vectors x 3 and b 3 can be thought of as “grid functions” on 0,. The values of b, g Ql and Ji on ilj 
are stored in b 3 (multiplied by the square of the mesh spacing when a uniform mesh is used). The 
values of g Q on dO 0 and an initial guess to the solution winfiU dQi are stored in x 3 before the call 
to DPMG. DPMG uses a central difference discretization of Poisson’s equation, even at Neumann 
boundary points. Dirichlet boundary points are not eliminated a priori. 

Interpolation is either bilinear, trilinear, or a fourth order method based on (3). The latter uses 
the difference operator, similar to a Gauss-Seidel iteration with a three color ordering and a rotated 

operator, to improve the order of the interpolation (see [13]). 

The three restriction methods are based on stencils. These are described in detail in [14]. The two 
second order methods are based on [1,2, 1] and [1,4, 1] weightings in one dimension. Tensor products 
are used to generate the stencils in higher dimensions. The fourth order stencil is an average of the 

[1,4, 1] tensor product stencil and point injection. ... . , 

Only Algorithms MGC and NIC are options. The solvers are sparse Gaussian elimination and 

Gauss-Seidel with either the natural or red-black orderings. 

DPMG was designed to run very fast on four quite different architectures: 

1. IBM mainframes with vector units. 

2. Conventional vector machines. 

3. Nonvector machines with multiply-add hardware chaining. 

4. Nonvector machines with no fancy hardware. 
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An example of 2 above is a Cray, an example of 3 is an IBM RISC System/6000 workstation, and an 
example of 4 is a SUN workstation or a PC. 

The Gauss-Seidel with the natural ordering subroutines were rewritten in IBM mainframe vector 
assembler. These routines are always faster than the Fortran equivalents no matter what size vectors 
are used. As an interesting aside, a version was produced that completely vectorizes by using an 
odd re-interpretation of how to compute the updates based on the trailing vector elements that 
normally do not vectorize. This is described in [15]. The trick does not work in Fortran, C, or C-t-+ 
unfortunately. 

The usual philosophy for vectorizing Gauss-Seidel is to use a red-black ordering. In addition, this 
allows the interpolation subroutines to ignore half of the fine grid points. However, the red-black 
ordering has an unfavorable feature. The right hand side and approximate solution vectors pass 
through cache twice per iteration. Only if a solver is written in a blocked by the cache size manner 
can this be alleviated. Due to the boundary conditions in (3) and the fact that the matrices are not 
stored in DPMG, this makes things overly complicated to program. Hence, DPMG uses a traditional 
implementation for the red-black subroutines. 

While the multilevel convergence properties of red-black Gauss-Seidel are better than the naturally 
ordered one, both solvers provide about equal performance when using Algorithm NIC and a V Cycle. 

4. LANGUAGE ISSUES 

In this section, advantages and disadvantages of Fortran, C, and C-l — I- will be discussed in the 
context of an abstract multilevel solver. A mixed solution will be proposed. 

4.1. FORTRAN 


In §3.2, the disadvantages of Fortran-77 in terms of data structures were discussed. There is no 
conceivable way to get around this. Even using macros or Ratfor only helps so much. The real 
problem is that users of the package still have to initialize the data structures. They are not likely 
to use either my macros or Ratfor. 

DA MG uses scratch storage in its solvers. Predicting the amount needed for each (solver, precon- 
ditioner) pair is an art which no user should ever have to master. Worse, the formulas given for some 
popular sparse matrix iterative solvers are wrong (predicting less memory than is required). For all 
of the solvers used in §3, the amount of scratch storage can be written in terms of N (the number or 
rows or columns), NZ (the number of nonzeroes in A,), and a constant: 

N aC r = C n ■ N + CnZ • NZ -(- C extra • ( 4 ) 

While default values can be used, the user should be able to override these. 

However, there are some areas where Fortran shines. For one, real and complex data types of 
various word lengths are part of the language. So, by using a simple preprocessor (e.g., /lib/cpp or 
m4) that is available on most computer systems used by people who do scientific computation, one 
source code can be maintained, even if multiple subroutine names are generated, one per data type 
supported. For example, in the Ratfor source code for DAMG, subroutine mgal is referenced by 

Namelt(mgal) 
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Table 6: 

New Matrix Structure 

struct Matrix { int 

MatrixType; 

/* the matrix type */ 

int 

MatrixCols; 

/* number of columns */ 

int 

MatrixRows; 

/* number of rows */ 

int 

MatrixLDim; 

/* leading dimension for dense matrices */ 

void 

*MatrixCoeffs; 

/* Pointer to matrix elements */ 

int 

*MatrixIA; 

/* Pointer to I A elements */ 

int 

}; 

*MatrixJA; 

/* Pointer to JA elements */ 

, . . _ ^ \ /• 1 . 1 \ 


Namelt prepends the letter d (double real), s (single real), 2 (double complex), or c (single complex) 


depending on the definition of a macro, FLOAT. 

Another area where Fortran does well is in optimizing code for certain classes of machines, particu- 
larly ones with vector units. The author naively assumed vector machines would go like the dinosaurs 
with the advent of superscalar, very fast workstations. Unfortunately (or fortunately depending on 
your view), vector units are being glued onto superscalar workstations by several manufacturers. 
While some C compilers have made serious inroads on producing very high-quality code, Fortran still 
holds some advantages in this case. 


4.2. C 


This language has an obvious disadvantage since complex and double complex are not a part of 
the language. While either of these can be defined as a structure, computing with them is inexcusably 
awkward. In particular, maintaining a single set of solvers for real and complex data means writing 
a set of weird macros to do floating point arithmetic. This is unacceptable. 

However, not all of DAMG’s or DPMG’s subroutines are solvers. In fact, the multilevel algorithm 
or choose which solver to call subroutines are really doing bookkeeping, not floating point arithmetic. 
For these subroutines, C provides all of the necessary features to dramatically simplify the entire 
calling sequence and these subroutines. Just being able to dynamically allocate and free memory 
would reduce the user’s frustration level with trying to guess how much memory to pass to DAMG 
for scratch storage. 

C can easily save addresses of objects, e.g., of subroutines or data objects, in complicated data 
structures. Hence, routines can be called incrementally to pass very complex data objects to an 
implementation of an abstract multilevel algorithm without any one call being very complicated. 
This reduces the aggravation of using a complex program considerably. 


4.3. C++ 


Many of the positive comments about C apply directly to C++. Classes can be constructed 
instead of structures. Further, C++ usually comes with a complex class (but not necessarily m both 

single and double precision), alleviating C’s worst drawback. 

One of C++’s strongest design features is the ability to design classes abstractly. At run time, the 
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Table 7 : External subroutine information structure 


struct 

ExternSubr { 

int 

(*Subr)(); 

int 

*IParms; 

void 

*FParms; 

float 

CN; 

float 

CNZ; 

float 

Cextra; 

int 

SaveScr 

void 

**Scrs 

int 

*NScrs 

} 



/* Pointer to integer function */ 

/* Pointer to integer parameters */ 

/* Pointer to floating point parameters */ 
/* See (4) V 
/* See (4) */ 

/* See (4) */ 

/* Save scratch areas between calls? */ 

/* Vector of pointers to scratch areas */ 
/* Vector of lengths of scratch areas * j 


correct version of some virtual routine is accessed. This feature, while useful, is overkill in the context 
of abstract multigrid solvers. The data type void* in C, a pointer to any data type, is sufficient to 
overcome many of the reasons why C++ would be useful in this context (see §4.4). 

A drawback to using C++ is that there is frequently a lot of overhead hidden from the user. 
This makes C-| b programs run unnecessarily slower than the equivalent C or Fortran programs. 
Interfacing C++ programs to Fortran programs is sometimes challenging, too. 

A more serious drawback is that CH — I- has not yet been standardized. It is evolving with major 
new versions coming out yearly. This would not be so bad except that features are sometimes dropped 
or changed in incompatible ways in newer versions of the language. For someone who wants to write 
a code once and then never have to touch it again, this is not a good point in C++’s favor. 

4.4. C AND FORTRAN: MIXED LANGUAGE PROGRAMMING 


My personal belief is that mixing Fortran+preprocessor and C is the best choice now. Implement 
Algorithms MGC and NIC in C and implement the computational solvers in FORTRAN+preprocessor. 
Numerous people who compute only know one language well and are not comfortable normally with 
a mixed language set of programs. An interface is described at the end of this section to let these 
people use what is proposed. 

Suppose that we make no assumption about the language of a solver or preconditioned subroutine, 
other than it really can be called from C. Then we do not know if it can dynamically allocate memory. 
Hence, some mechanism must be defined for passing a block of memory. One way is to define a 
structure for externally called subroutines, e.g., Table 7. The subroutine is expected to return some 
indication of whether or not it worked or produced an error. The IParms and FParms are integer 
and floating point vectors containing information that the specific subroutine actually needs. Setting 
CN=CNZ=Cextra=0 could signify “use the defaults.” Note that only one ExternSubr structure has 
to be created per subroutine. In this definition, Subr is a pointer (or external reference) to an integer 
valued function with a fixed set of arguments. By providing an include file with an abstract solver, 
a set of default ExternSubr structures can be given to the user (see Table 1). 

Consider Table 4. A single structure can be defined that defines everything in a column of Table 
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4 so that information about matrices can be made easier to define. Also, pointers to the actu 
floating point and integer vectors or matrices can be defined (instead of indices into a messy vector), 

placing all of the relevant information in one place (see Table 6). 

Information that is in both Tables 3 and 4 can be re-arranged into a single data structure as m 
Table 8. A NULL pointer can be used to indicate the lack of existence of a matrix. 

An implementation of Algorithm MGC can then use the information in Levlnfo and the ExterSubr 
structures to first allocate scratch space (if necessary), then call the solver. Assume Ip is a pointer 
to level j’s Levlnfo structure, that lap is a pointer to Ip -* A,’s Matrix structure, Ips is a pointer to 
Ip — >solver’s ExternSubr structure, and Ipp is a pointer to either Ip ->precond’s ExternSubr structure 
or an empty one. Then the solver is called using the following. 

iret = Ips — »Subr( dtype, Ipp ->Subr, Ip ->SolverIters, Ip ^SolverRNorm, 

Ip — >matrix_vec, lap — ►MatrixType, tap ►MatrixRows, 
lap — >MatrixCols, lap — >MatrixCoeffs, lap — >MatrixIA, 
lap — >MatrixJA, Ip -> X jt Ip -» B } , Ips -^IParms, 

Ips — >FParms, resid, scrs, nscrs, scrp, nscrp, oldscr ); 

Here scrs and scrp are pointers to scratch storage (with lengths nscrs and nscrp) for use by the solver 
and the preconditioner subroutines. Whether or not this is the same set of scratch areas as a previous 
call is indicated by oldscr. The resid argument is so that the solver has a place to return the residua , 
which is used in calculating the next correction problem on a coarser level. 

Numerous iterative procedures, based primarily on conjugate direction methods, require a user 
callback routine to calculate matrix-vector products, thus requiring a matrix _vec argument to be 
passed. Also, many iterative procedures allow a stopping criterion based on reducing the (possi y 

scaled) residual norm by some amount, e.g., Ip —> Solver RNorm. 

There is an important issue that must be addressed. There are many people who compute who 
do not know C, but only Fortran. Using the data structures advocated in §4.2 would preclude these 
people from using the abstract solvers. Some simple subroutines, callable from Fortran (or any 
language) that build the data structures in a portable manner must be included. For example, a 
Fortran program can call a C program which returns a data handle (a small integer): 
mgh=mgini ( levels, dtype ) 

This subroutine allocates space for the structures. The integer argument dtype is used to determine 
the data type (c.f., the value of FLOAT in §4.1): 


Dtype 

Data 

Floating point data description 

1 

float 

single precision real 

2 

double 

double precision real 

3 

complex 

single precision complex 

4 

dcomplex 

double precision complex 

<0 

user 

—value = length in bytes 


While this may seem ugly, this simple mechanism allows the C codes to be written in a “typeless 
manner. Note that a mechanism is in place for user defined data types as well. 

Matrix structures are defined similarly and return a matrix handle. 

mat = mgmat ( mgh, type, cols, rows, ldim, coeffs, ia, ja ) 

Matrix handles are coupled to the data handle. 
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Table 8: Level Information Structure 


struct Levlnfo { 


struct 

ExternSubr *solver; 

struct 

ExternSubr *precond; 

struct 

ExternSubr *matrix_vec; 

struct 

ExternSubr *changeJev; 

int 

Solverlters; 

float 

SolverRNorm; 

int 

MGIters; 

int 

Nllters; 

void 

* v • 

void 

*Bj; 

int 

NX 

int 

NBj ; 

int 

NZA j-, 

struct 

Matrix *Aj] 

struct 

Matrix *Rj] 

struct 

Matrix *Pj\ 

struct 

Matrix *NIPj\ 

struct 

Matrix *FASRj] 


>; 


/* Pointer to how to call solver */ 

/* Pointer to how to call preconditioner */ 
/* Pointer to how to call matrix*vector */ 
/* Pointer to how to call level changer */ 
/* Number of iterations in solverQ */ 

/* How much to reduce residual norm */ 
/* Number of iterators of MGC */ 

/* Number of iterators of NIC */ 

/* Pointer to Xj */ 

/* Pointer to bj */ 

/* Length of Xj */ 

/* Length of bj */ 

/* Number of nonzeroes in A, */ 

/* Pointer to Aj representation */ 

/* Pointer to 7 Zj representation */ 

/* Pointer to Vj representation */ 

/* Pointer to MTV j representation */ 

/* Pointer to pj FAS ^ representation */ 


Subroutines are declared through another C routine: 
real CN, CNZ, Cextra 
external rtn 

(set CN, CNZ, and Cextra) 

isubr = mgsubr ( mgh, rtn, iparms, fparms, CN, CNZ, Cextra, savscr 

Note that only the addresses of rtn, iparms, and fparms are saved by mgsubr, not the contents. A 
subroutine handle is returned which is coupled to the data handle. Use of the Fortran EXTERNAL 
declaration allows subroutine addresses to be passed. 

Another routine can be called to setup a Levlnfo structure for level j: 
iret = mglevi ( mgh, j, isolver, iprecond, imatv, ichlev, 

* nsolviters, rnorm, mgiter, niiter, xj, bj, nxj, nbj, 

* nza, mata, matr, matp, matnip, matfas ) 

Here, isolver, iprecond, imatv, and ichlev are the return values from mgsubr or 0 if none is wanted. 
Also, mata-matfas are return valves from mgmat or 0 if no matrix exists. The x.j and b_j are the 
addresses of the first elements of xj and bj. These may be indexed as X(ixb) and B(ixb), respectively, 
depending on the user’s programming style. A nonzero return value means an error occurred. 
Finally, the multilevel subroutines can be called: 
iret = mgmeth ( mgh, iparm, resid ) 

where iparm is a simplification of the one in Table 2 (it only needs to contain mgalg, startl, levelc, 
levelf, and info, but is extendable). The last argument, resid, is an array where the final residual is 
returned. A nonzero return value means an error occurred. 
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To free space, a final call can be made: 

iret = mgdone ( mgh ) t ^ nonzero re- 

turn value means an error occurred. Obviously, this last call is unnecessary if the program immedi- 
ately ends. . , 

The advantage of this approach is that subroutines can be written in whatever language makes the 

most sense. Further, people who program in C or C++ will not be penalized by having to construct 

data structures that only make sense in Fortran. 

The worst disadvantage is that to compile the library, some knowledge is needed about how 
the local compiler treats subroutine names. There are three common methods in use and on many 
platforms this can be determined automatically. On a very small number of machines, Fortran and 
C programs cannot be mixed conveniently or at all; these machines will be ignored by this author. 

5. CONCLUSIONS 


In this paper, abstract multilevel methods were reviewed. Two versions of the author’s publicly 
distributed multilevel codes (Madpack) were discussed. From the experience of these codes, a model 
of a better approach using a mixed language approach (C and Fortran+preprocessor) was proposed. 
Implementing such a system, starting from having already working solvers (e.g., [8], [3], and [4]) is a 
simple exercise for an expert in C and Fortran programming. 
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SUMMARY 


Flame sheet problems are on the natural route to the numerical solution of multidimensional 
flames, which, in turn, are important in many engineering applications. In order to model the flame 
structure more accurately, we use the vorticity-velocity formulation of the fluid flow equations 
instead of the streamfunction-vorticity approach. The numerical solution of the resulting nonlinear 
coupled elliptic partial differential equations involves a pseudo transient process and a steady state 
Newton iteration. Rather than working with dimensionless variables, we introduce scale factors 
that can yield significant savings in the execution time. In this context, we also investigate the 
applicability and performance of several multigrid methods, focusing on nonlinear damped Newton 
multigrid, using either one way or correction schemes. 

1. INTRODUCTION 


Recent advances in the development of computational algorithms and supercomputers have 
provided new extremely powerful tools with which to investigate chemically reacting systems that 
were computationally infeasible only a few years ago (see [1], [2], [3], and [4]). The difficulties 
associated with solving high heat release combustion problems stem from the large number of 
dependent unknowns, the nonlinear character of the governing partial differential equations and the 
different length scales present in the problem. Typical combustion problems may involve, in 
addition to the temperature and the fluid dynamics variables, dozens of species defined at each grid 
point and require the resolution of curved fronts whose thickness is on the order of thousandths of 
the domain diameter, across which critical fields vary by orders of magnitude. As a result of the 
fluid dynamics-thermochemistry interaction and its effect on the flame structure, the governing 

This work was supported in part by CERMICS, ENPC, IBM, the Office of Naval Research, and the Department 
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equations are strongly coupled together and are also characterized by the presence of stiff source 
terms and nonlinearities. Hence, Newton methods with sophisticated control strategies, including 
damping and adaptive continuation techniques, are needed. However, in spite of these difficulties, 
the numerical modeling of multidimensional laminar (or turbulent) flames has been recently 
motivated by the growing demand for high fuel efficiency combined with low pollutant emission. 
While three dimensional turbulent flame simulations still remain infeasible on current 
supercomputers, axisymmetric laminar diffusion flames constitute a problem of practical 
importance since they are the flame type of several combustion devices. Hence, new robust 
numerical models of such a system will provide an efficient tool to probe flame structures and 
investigate the coupled effects of complex transport phenomena with chemical kinetics. 

As part of an ongoing effort to expand combustion modeling capabilities, we investigate 
computationally the performance of several multigrid techniques (see [5], [6], [7], and [8]) combined 
with the numerical solution of combustion related problems. In the present work, we consider a 
flame sheet problem rather than a finite rate chemistry model for an axisymmetric laminar diffusion 
flame in order to alleviate the memory and CPU requirements on the computer simulations. The 
numerical techniques presented in this paper, however, also apply to combustion problems with 
finite rate chemistry [9]. We note that a flame sheet model adds only one field to the hydrodynamic 
fields that describe the underlying flow. A detailed kinetics model adds as many fields as species 
considered in the kinetic mechanism, each with its own coupled conservation equation. Since the 
CPU time and the memory requirements scale with the square of the number of dependent 
unknowns, the flame sheet model considerably reduces the cost of the computer simulations while 
still keeping the coupling and nonlinearity features associated with the original problem. 

In the flame sheet model, the chemical reactions are described with a single one step irreversible 
reaction corresponding to infinitely fast conversion of reactants into stable products. This reaction 
is assumed to be limited to a very thin exothermic reaction zone located at the locus of 
stoichiometric mixing of fuel and oxidizer, where temperature and products of combustion are 
maximized. To further simplify the governing equations, one neglects thermal diffusion effects, 
assumes constant heat capacities and Fick’s law for the ordinary mass diffusion velocities, and takes 
all the Lewis numbers equal to unity [2]. With these approximations, the energy equation and the 
major species equations take on the same mathematical form and by introducing Schvab-Zeldovich 
variables, one can derive a source free convective-diffusive equation for a single conserved scalar. 
Although no information can be recovered about minor or intermediate species in the flame sheet 
limit, the temperature and the stable major species profiles in the system can be obtained from the 
solution of the conserved scalar equation coupled to the flow field equations. Further, the location 
of the physical spatially distributed reaction zone and its temperature distribution can be 
adequately predicted by the flame sheet model for many important fuel-oxidizer combinations and 
configurations. Since being studied as a means of obtaining an approximate solution to use as an 
initial iterate for a one dimensional detailed kinetics computation in [10], flame sheets have been 
routinely employed to initialize multidimensional diffusion flames. 

In §2, a comparison of three possible formulations of the problem is presented, including the 
governing equations and boundary conditions. In §3, the general solution algorithms are presented, 
including a damped Newton method, Jacobian evaluation, linear solvers (Bi-CGSTAB or GMRES), 
and the pseudo transient process. In §4, various multigrid methods are discussed in the context of 
flame sheets. In §5, numerical experiments are presented. Finally, in §6, some conclusions are 
reached. 
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2. VORTICITY- VELOCITY FORMULATION 


In diffusion flames the combustion process is primarily controlled by the rate at which the fuel 
and oxidizer are brought together in stoichiometric proportions. Thus, independently of the 
submodel used for the chemical kinetics (finite rate vs. flame sheet), the overall accuracy of the 
numerical solution strongly depends on an accurate representation of the flow field. Hence, a brief 
discussion on the various formulations of the Navier-Stokes equations in the context of laminar 
combustion problems is of order. 

The first numerical solution of two dimensional axisymmetric laminar diffusion flames was 
obtained using the streamfunction-vorticity formulation [2]. This approach is attractive for three 
reasons: 

1. It eliminates the coupling associated with the presence of the pressure in the momentum 
equations. 

2. It reduces the number of equations to be solved by one. 

3. It also has the important advantage that continuity is explicitly satisfied locally. 

However, the specification of boundary conditions meets with difficulties when one attempts to 
specify vorticity boundary values. In particular, a zero vorticity boundary condition at the inlet of 
the computational domain results in a rough approximation of the true solution, thus severely 
altering the resulting velocity field [3]. On the other hand, the specification of vorticity boundary 
values in terms of the streamfunction requires the discretization of second order derivatives, thus 
yielding off diagonal terms in the Jacobian matrix which result in having to solve severely ill 
conditioned linear systems. Another important difficulty associated with the 

streamfunction-vorticity approach is that the extension to three dimensional configurations through 
the introduction of a vector potential instead of the scalar streamfunction is cumbersome and 
computationally expensive since it introduces additional dependent variables. 

Alternatively, a primitive form of the Navier-Stokes equations has been recently implemented for 
several axisymmetric laminar diffusion flames (see [3] and [4]). In this approach, the velocity field is 
computed using the momentum equations and the pressure field is recovered from the continuity 
equation. As a result of the difference in nature of the governing equations, the discrete pressure 
field has to be determined in a manner consistent with the discrete continuity equation. This can 
be achieved to machine zero on a staggered grid. However, staggered mesh schemes do also have 
drawbacks in complex geometries configurations where non-orthogonal curvilinear coordinates axe 
used and when using sophisticated numerical techniques such as multigrid methods (see [11] and 
[12]). Although feasible ([13] and [14]), the development of staggered grid based multigrid solvers is 
computationally cumbersome since the transfer operators between levels do not coincide for each 
dependent variable in order to preserve a staggered grid arrangement on all levels. This difficulty 
may even be further exacerbated in three dimensional configurations. Finally, it is worthwhile to 
note that two and three dimensional solutions of incompressible viscous flows on a nonstaggered 
grid have been reported (see [11] and [12]). However, the extension of such procedures to highly 
compressible systems where the density can vary by several orders of magnitude inside the 
computational domain may still yield some complications. 
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The vorti city- velo city formulation constitutes a third approach to the numerical solution of the 
Navier-Stokes equations. A review of incompressible fluid flow computations using this formulation 
is well documented in [15]. The vorticity- velocity formulation of the Navier-Stokes equations has 
been recently extended to two and three dimensional compressible flows and implemented for the 
numerical solution of flame sheet problems (see [16] and [17]). As motivated in these references, a 
vorticity- velocity formulation allows replacement of the first order continuity equation with 
additional second order equations. Whereas the streamfunction-vorticity formulation also 
accomplishes the same replacement in two dimensions, vorticity- velocity is extensible to three and 
allows more accurate formulation of boundary conditions in a numerically compact way. 
Furthermore, off diagonal convective terms in off diagonal blocks that exert a strong influence in a 
streamfunction-vorticity formulation disappear. Another important attractive feature of the 
vorticity- velocity formulation is that the governing equations can be discretized on a nonstaggered 
grid, thus allowing the implementation of a multigrid algorithm at a relatively low overhead in 
additional programming (see [16], [17], and [18]). 

The flame sheet governing equations consist of the conservation of total mass, momentum and a 
conserved scalar equation. The conservation of total mass and momentum equations constitute the 
flow field problem and are formulated using the vorticity- velocity formulation of the compressible 
axisymmetric Navier-Stokes equations. A source free convective-diffusive equation for a conserved 
scalar is solved coupled together with the flow field equations and the temperature and major 
stable species profiles in the system can be recovered from the conserved scalar (see [2], [19], and 
references therein). We introduce the velocity vector v = (v r ,v z ) with radial and axial components 
v r and v z , respectively, and the normal component of the vorticity 


_ dv r dv z 

U dz dr 


( 1 ) 


The vorticity transport equation is formed by taking the curl of the momentum equations, which 
eliminates the partial derivatives of the pressure field. A Laplace equation is obtained for each 
velocity component by taking the gradient of (1) and using the continuity equation. This yields the 
governing equations in the following form: 




d 2 v z i d 2 v z 
dr 2 ^ ~dz f 
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dr r dz dz \ P ) 1 
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2 (V(div(„)) ■ Vf - V« r • - Vv, • V|^) , 

pv r -§? + fw,$l, 


( 2 ) 


where p is the density, p the viscosity, g the gravity vector, div(u) the cylindrical divergence of the 
velocity vector, S the conserved scalar, D a diffusion coefficient, and the components of V/3 are 
(^? 5 _ip j The density is computed using the perfect gas law and, in the low Mach numbers 
approximation valid for these flame configurations, one can use the outlet (constant) pressure. 
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Table 1: Boundary conditions 


Axis of symmetry (r = 0) 

v r = 0 

¥ = o 

ui = 0 

£ = o 

Outer zone (r = Rm ax ) 

^ = 0 

#=° 

u = 

dz 

5 = 0 

Inlet ( z = 0) 

v r = 0 

v* = v° z (r ) 


S = S°(r) 

Exit (z = L) 

v r = 0 

-%=° 


£ = o 


Consequently, in the above formulation, the pressure field is eliminated from the governing 
equations as a dependent unknown and can be recovered, once a computed numerical solution of 
(2) is obtained, by solving a Laplace type equation derived by taking the divergence of the 
momentum equations [15]. 

Recalling that all of the Lewis numbers are taken equal to unity, the quantity pD is given by the 
viscosity coefficient p divided by a reference Prandtl number and we use an approximate value for 
^ >r = 0.75. Hence, in this model, the determination of all the transport coefficients is reduced 
to the specification of a transport relation for the viscosity and we use the same power law as the 
one given in [2], We also note that, due to the high temperature gradients present in the system, 
the viscosity derivatives in the right hand side of the vorticity transport equation (2) can not be 
neglected. Our numerical experiments show that such an approximation leads to significant 
differences in the numerical solution, especially for the radial velocity profile. Finally, a conservative 
form of the convective terms can also be considered but it yields slower convergence rates without 
any significant changes in the computed solution. 

A schematic of the physical configuration is given in Figure 1. It consists of an inner cylindrical 
fuel jet (radius Rj =0.2cm), an outer co-flowing annular oxidizer jet (radius Ro =2. 5cm) and a 
dead zone extending to R m ax =7. 5cm. The inlet velocity profile of the fuel and oxidizer are a plug 
flow of 35cm/s. This yields a typical value for the Reynolds number of 550. Further, the flame 
length is approximately Lj =3cm [19] and the length of the computational domain is set to 
L =30cm. Although the fuel and oxidizer reservoirs are at room temperature (300°Kelvin), we need 
to assume, in the flame sheet model, that the temperature already reaches the peak temperature 
value along the inlet boundary at r = Rj. This peak temperature is estimated for a meth an e-air 
configuration to be 2050° K. Hence, the inlet profile of the conserved scalar, is specified in 

such a way that the resulting temperature distribution blends the room temperature reservoirs and 
the peak temperature by means of a narrow Gaussian centered at # 7 . The narrowness of the 
Gaussian profile has a relevant influence on the calculated flame length, so that its parameters have 
to be determined appropriately [19]. The bou ndary conditions are summarized in Table 1. Finally, 
we note that the use of the definition of the vorticity (1) for the vorticity outlet boundary condition 
does not yield any relevant changes in the computed solution. 

3. GENERAL SOLUTION ALGORITHM 

The partial differential equations (2) together with the boundary conditions (see Table 1) are 
discretized on a two dimensional tensor product grid. A solution is first obtained on an initial 
coarse grid. Additional mesh points are then adaptively inserted in regions of high physical activity 
by equidistributing weight functions of the local gradient and curvature of the numerical solution 
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Figure 1: Physical configuration (not in scale) 


■ 2 1 which yields a 129 x 161 grid. To verify the grid independence of the solution, we refined this 
m’d to 257 x 219 points. The relative error between the two solutions was found to be lower t an 
2% and differences were only encountered in the outflow region where the grids were sti ep 
somewhat coarse. However, the flame length and the temperature distribution inside the flame were 
accurately predicted on the 129 x 161 grid. Hence, this grid will be considered as the finest grid m 

^hTsmtiToperators in the partial differential equations (2) are approximated with finite 
difference expressions. Diffusion and source terms are evaluated using centered differenc^ We 
adopt a monotonicity preserving upwind scheme for the convective terms (see [20, p. 304]), lor 

instance, c o c. _ S 

v r ~ = max{(v r ) i _i,0}^ — -max{-(v r ) i+ i,0} 


( 3 ) 


dr IV- /-I'-' rj _ ri _, * Ti+i-n 

Tie boundary conditions given in Table 1 involve only zero or first order derivatives. For the latter 
erms first order back or forward differences can be used, except for two boundary conditions which 
equire a more accurate treatment. First, as motivated in [17], the vorticity inlet boundary 
Edition is discretized using the vorticity values at the first two lines of the computation^ domain, 
dore specifically, at an inlet point (i, 1), we discretize the equation w - dT as lollows. 


1 . \ (v r ) 2 {Vz)i ± 1 ~ Mi -1 

-({Jl + LO2) — ~ “ r . , _ r . , 

2 Z 2 — z\ r *+i r »-i 


( 4 ) 


It is also of critical importance for the accuracy of the numerical solution that the axial velocity 
boundary condition on the axis of symmetry be evaluated using a second order scheme. At any 
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point we have 

(».)» - Mi = (r2 ~ n)2 ^r + ° (to - r >) 2 ) ■ 

The right hand side is evaluated using the Laplace equation for v z in (2) . On the axis of symmetry, 
this reduces to 

d 2 v z _ ^ d 2 v z du d f v z 
dr 2 dz 2 dr dz\pdz ) 

The radial derivative of the vorticity can be discretized with a first order difference while still 
yielding an overall second order accuracy for v z . By comparing our numerical solutions with a 
primitive variable solution of the same problem [19], we found that these two boundary conditions 
exerted a strong influence on the overall accuracy of the numerical solution. 

The discretization of the partial differential equations (2) together with the boundary conditions 
(Table 1) yields a set of algebraic equations of the form F(U ) = 0, which is solved using a damped 
Newton method 

J(U n )AU n = -X n F(U n ), n = 0,l,..., (5) 

with convergence tolerance ||AC/ n || 5 < 10 -5 . The Jacobian matrix J(U n ) is computed numerically 
using vector function evaluations and the grid nodes are split into nine independent groups which 
are perturbed simultaneously (see [2] for more details). Selected cases were rerun with a more 
stringent convergence tolerance of 10 fi , without any significant changes in the numerical solution. 
Rather than working with dimensionless variables, we introduce a scale factor cq, l G [l,n c ], for 
each dependent variable (n c = 4 for the flame sheet problem). The norm of the discrete vector AU n 
is then given by 

wu-fz E E (6) 

V *6[l,n r ] 

It is worthwhile to point out that an appropriate choice of the scale factors can yield significant 
savings in the execution time. This point will be further illustrated with numerical experiments in 
§5.1. 

The linear system (5) is inverted at each Newton step through an inner iteration. This inner 
iteration may consist of either the Bi-CGSTAB algorithm [21] or a restarted version of GMRES [22] 
combined with a Gauss-Seidel (GS) left preconditioner. This choice is motivated in [16] through 
various numerical simulations of flame sheet problems. Although a single Bi-CGSTAB /GS iteration 
requires approximately 1.5 times more time than an average GMRES/GS iteration, both algorithms 
yield total execution times which are in general within a few percent of each other. The former has 
lower memory requirements (see the end of §5.2 for more details). The convergence of the inner 
iteration is based on the norm of the left preconditioned linear residual using an absolute tolerance 
equal to one-tenth of the Newton tolerance. Such termination criterion brings enough information 
on the update vector AU n back to the Newton iteration (see [16] for more details). 

Due to the nonlinearity of the original problem, a pseudo transient process is used to produce a 
parabolic in time problem and bring the starting estimate into the convergence domain of the 
steady Newton method. The original nonlinear elliptic problem is cast into a parabolic form by 
appending a pseudo transient term ^ to the original set of algebraic equations F(U) = 0, and a 
fully implicit scheme solves (again with Newton method) 

HU"*') = F(U n+1 ) + = 0, (7) 
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where At n+1 is the (n + l) st time step. The number of time steps needed to bring the initial 
guessed solution into the convergence domain of the steady Newton iteration depends on the size of 
the grid, and the coarser the grid, the fewer relaxation steps are necessary. This point will be 
further discussed in §5.2. 


4. MULTIGRID TECHNIQUES 

The multigrid philosophy applied to our model problem is derived from [5], [7], and [8]. We 
assume that there is a sequence of spaces Mi, i = 1 ,—,k, where the Mi approximate M\. We 
further suppose there exist restriction and prolongation mappings 

f Ki : Mi — ► Mi+i, l<i<k-l, 

\ Pi : Mi-* Mi-i, 2 <i<k. 

between neighboring spaces. We also assume there is a sequence of problems (5) represented by J , . 
A multilevel correction algorithm, where the finest level is level 1 and the coarsest level is level k, 

is simply defined by 

Algorithm MGC ( lev, {Jj, Xj, bj}*^, {Tj}*= 2 > ) 

1. xi ev «— Solver i ev(Jle v, Xlev, blev) 

2. If lev < k, then repeat 2a-2d until some condition is met: 

2a. X(ev+1 0» blev + 1 *— P-leviplcv ~ Jlev x lev ) 

2b. MGC ( lev + 1, {Jj , Xj , bj} k -i, {Pj}j- 2 , {Pj}j = i ) 

2c. 4 x igy i / Pl€v-\-\ x lcv-\~'l. 

2d. x iev 4- Solver iev (Ji ev ) 'Kiev > bl ev ) 

In our case, the solver on every level is either Bi-CGSTAB/GS or GMRES/GS. In Step 1 on level k, 
our stopping criterion was that the linear residual was adequately reduced (see §3). On the other 
levels, the stopping criteria was either an upper limit on the number of iterations or that the linear 
residual was adequately reduced. 

A common condition in step 2 is to do steps 2a-2d some specified number of times (e.g., 0 for 
one way multigrid, 1 for a V Cycle, or 2 for a W Cycle). In §5.2, a V Cycle took less overall time 
than any other choice for a condition in step 2. However, many V Cycles were necessary, starting 
from the finest level (see the definition of Algorithm NIC below). 

Brandt’s FAS algorithm [6] is a nonlinear variant of Algorithm MGC. A nonlinear smoother is 
used in steps 1 and 2d, the actual solution is computed on every level, and corrections are 
computed before interpolation in step 2c (see [23] for more details). 

We use a nested iteration multilevel algorithm since we do not have an adequate initial guess to 

the solution initially. 

Algorithm NIC ( lev, {Jj,Xj,bj}j=i, {Vj} k j = z, {TZj }^ i ) 

1. MGC ( k, {Jj,Xj,bj} k —i, {Pj}j~2, {P-j} j = i ) 

2. Do steps 2a-2b with lev = k — 1, • • ■ , 1: 

2a. X[ ev 4 Plev+l x lev+l 

2b. MGC ( lev, {Jj,Xj,bj } $ =1 , {Pj }) = 2 , {Pj}j=i ) 


150 



A damped Newton multilevel algorithm is defined by introducing an additional step before each 
reference to Algorithm MGC in just Algorithm NIC. Before each reference to Algorithm MGC, a 
Jacobian is formed and a damped Newton step is performed. The last Jacobian on a level is saved 
for use in multilevel correction steps. A one way multilevel algorithm means that Algorithm MGC 
never performs any portion of its step 2 as part of its use by Algorithm NIC. We always use a 
damped Newton iteration, but we drop the term damped Newton when referring to one way 
multilevel methods. 

The difference between FAS and damped Newton multilevel methods is easy to categorize. FAS 
uses a nonlinear iterative method (e.g., nonlinear Gauss-Seidel) while damped Newton uses 
standard linear solvers. When evaluating the nonlinear function is inexpensive, FAS usually 
produces an approximate solution faster than the damped Newton multilevel method. However, 
when the function evaluations are expensive, the damped Newton multilevel method usually 
produces an approximate solution faster than FAS. In a typical diffusion flame problem with finite 
rate chemistry [9], the function evaluations are horrendously expensive, so we did not explore FAS. 
For a flame sheet problem solved using FAS, see [24]. 

5. NUMERICAL RESULTS 


In this section, we present several numerical results obtained on an IBM RISC System/6000 
(model 560). In §5.1, we focus on unigrid calculations and emphasize the importance of the scale 
factors ay in (6) in order to appropriately monitor the convergence of the outer damped Newton 
iteration. Our numerical experiments show that the overall execution times can be decreased by up 
to an order of magnitude by taking a large scale factor for all of the vorticity corrections in the 
computational domain. The execution times can be decreased by an additional factor of six and ten 
by combining the unigrid numerical procedure with damped Newton multilevel iterations, using 
either one way or correction schemes, respectively. The corresponding numerical results are 
presented in §5.2. 


5.1. Unigrid tests 


In this section, we discuss the influence of the scale factors cq in (6) on the whole convergence 
history of the numerical solution. By modifying these scale factors, we shift the balance of work 
required in the outer Newton iteration and in the inner linear iterations between the different 
degrees of freedom present in the system. In particular, a large scale factor for the vorticity 
component asks for less accuracy in the computed vorticity corrections that are brought back to the 
Newton iteration, thus reducing considerably the amount of work at each Newton step. As 
indicated in our numerical experiments, this does not yield any loss of accuracy for the other 
components of the numerical solution (the radial and axial velocity and the conserved scalar) . 
Another important consequence is that much larger time steps can be taken, even at the beginning 
of the pseudo transient process when the solution is approximated with a very “coarse” initial 
guess. Furthermore, only a few time steps are required (typically 20) before the numerical solution 
already lies in the convergence domain of the steady Newton iteration (5). With lower scale factors 
for the vorticity, most of the CPU time is spent during the pseudo transient iterations, since much 
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smaller time steps need to be taken and the convergence domain of the iteration scheme (5) 
becomes much narrower. Our numerical experiments indicate that a scale factor for the vorticity of 
10 1 2 3 can yield savings in CPU time of up to an order of magnitude without altering the velocity and 
temperature profiles of the numerical solution. 

5.2. Multi grid acceleration 


In this section, we present further improvements in the total execution times obtained by 
combining the numerical procedure described in §3 and §5.1 with damped Newton multilevel 
iterations, using either one way or correction schemes. In all of the results, the speedups represent 
ratios of CPU times. 

We consider the finest level to be a 129 x 161 grid and we construct three additional coarser 
grids by successively discarding every other node from one grid to the coarser one. This yields a 
coarsest grid of 17 x 21 points. It is worthwhile to note that the use of even coarser grids in these 
problems meets with difficulties since the calculated flame speeds become excessively large due to 
the influence of numerical diffusion and/or conduction (see [25]) and the Newton iteration (5) fails 
to converge. 

In the one way nonlinear multigrid approach, we solve the nonlinear problem F(U) — 0 in one 
cycle, starting at the coarsest level and ending at the finest. Asymptotically, as the mesh spacing 
approaches zero, the interpolant of the computed solution oft one grid lies in the convergence 
domain of Newton method on the next finer grid [26]. In our numerical calculations, this was found 
to be the case for all levels considered, when using either cubic or linear interpolation between 
levels. As a consequence, the pseudo transient process needs only to be performed on the coarsest 
level, in order to bring the initial guess into the convergence domain of the steady Newton iteration 
on this level. This procedure is particularly attractive for two reasons: 

1. By time stepping on the coarsest level, we reduce considerably the amount of work spent in 
the pseudo transient phase. 

2. On coarser grids, less computer time is needed to solve (5). 

The first set of numerical experiments was performed using Bi-CGSTAB/ GS as the linear 
smoother. The numerical results obtained during the pseudo transient phase are presented in 
Table 2. On our workstation, the time stepping requires 15 seconds on the coarsest level as opposed 
to over 40 minutes on the finest, thus yielding a speedup of 166. Table 3 breaks down the numerical 
results for the steady state Newton iterations. Note that the CPU time spent during the pseudo 
transient process has been included in the computation of the speedups presented in Table 3. A 
speedup of a factor of four is achieved using the one way nonlinear multigrid on two levels, which is 
due to the significant decrease of smoothing steps done on the finest level. With three and four 
levels, we obtained speedups of 5.4 and 5.8, respectively. The four level multigrid improves only 
marginally the execution times, since it decreases the CPU time spent on the third level, while most 
of the work is already concentrated in the smoothing iterations on the finest level. Finally, it is 
interesting to note that linear interpolation between levels yields lower execution times than cubic 
interpolation when Bi-CGSTAB/GS is used as the linear smoother. 

We also implemented the one way nonlinear multigrid algorithm using GMRES/GS as the linear 
smoother with 25 Krylov vectors. This requires 15 Mb of additional storage for the Krylov space. 


152 


Table 2: Numerical results for one way nonlinear multigrid during the pseudo transient phase with 
Bi-CGSTAB/GS as the linear smoother 


Operation 

Levels 

1 

2 

3 

4 

BiCGSTAB/GS iterations 

634 

352 

217 

160 

Speedup in time 

1.0 

6.6 

34.6 

166.0 


Table 3: Numerical results for one way nonlinear multigrid 


Operation 

Levels 

1 

2 

3 

4 

smooth(l) 

1632 

371 

384 

378 

smooth (2) 

— 

723 

390 

380 

smooth (3) 

— 

— 

326 

346 

smooth (4) 

— 

— 

— 

192 

Speedup in time 

1.0 

~~£2T 

5.4 

~5T 


Smooth (i) represents the total number of Bi-CGSTAB/GS steps done on level i during the steady 

state Newton iterations. 


We found in our numerical experiments that the use of cubic interpolation between levels yielded 
lower execution times than linear interpolation and that it was more efficient to adaptively increase 
the time step slightly faster during the pseudo transient phase with respect to the Bi-CGSTAB/GS 
calculations. The numerical results are given in Tables 4 and 5. We obtain a speedup of 160 for the 
pseudo transient phase on four levels. As indicated in Table 5, the total execution times delivered 
are greater than the ones obtained with Bi-CGSTAB/GS. This latter algorithm seems therefore to 
be a preferable linear smoother when using one way nonlinear multigrid. Note also that the unigrid 
calculation fails to converge since GMRES/GS stagnates. 

In order to solve the linear systems more efficiently, especially the one on the finest level, we 
perform damped Newton multilevel iterations, making use of the Jacobians computed on all levels 
coarser than the current one (see algorithm MGC in §4 for more details). The numerical results 


Table 4: Numerical results for one way nonlinear multigrid during the pseudo transient phase with 
GMRES/GS as the linear smoother 


Operation 

Levels 

2 

3 

4 

GMRES/GS iterations 

572 

367 

258 

Speedup in time 

7.2 

34.6 

159.6 
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Table 5: Numerical results for one way nonlinear multigrid 


Operation 

Levels 

2 

3 

4 

smooth(l) 

530 

945 

945 

smooth (2) 

1559 

592 

590 

smooth (3) 

— 

481 

825 

smooth (4) 

— 

— 

161 

Speedup in time 

3.2 

4.2 

4.2 


Smooth(i) represents the total number of GMRES/GS steps done on level i during the steady state 
Newton iterations. The speedups are with respect to the unigrid solution time in Table 3. 


Table 6: Numerical results for damped Newton multilevel iterations 


Operation 

Levels 

1 

2 

3 

4 

smooth(l) 

1632 

238 

268 

243 

smooth (2) 

— 

1096 

645 

673 

smooth(3) 

— 

— 

861 

1243 

smooth (4) 

— 

— 

— 

799 

Speedup in time 

1.0 

4.8 

6.2 

6.6 


Smooth(i) represents the total number of Bi-CGSTAB/GS steps done on level i during the steady 

state Newton iterations. 


presented in Table 6 are obtained using 30 steps of Bi-CGSTAB/GS as the linear smoother, which 
may seem at first glance to be an excessive number of iterations. We obtain a speedup of 6.6 when 
using four levels. A comparison of Tables 3 and 6 shows that the balance of smoothing iterations is 
shifted towards the coarsest levels when using damped Newton multilevel iterations, thus yielding 
lower execution times (approximately 12%) than the ones obtained with the one way nonlinear 
multigrid. However, it is worthwhile to point out that this improvement comes at the expense of 
storage since the one way nonlinear multigrid requires 39 Mb and the damped Newton multilevel 
iterations require up to 62 Mb. This difference is due mainly to the fact that damped Newton 
multilevel correction methods require saving a Jacobian on every level instead of just one. 

Finally, we also performed damped Newton multilevel iterations using GMRES/GS as the linear 
smoother. In our numerical experiments, we found that the choice of 25 Krylov vectors delivered 
lower execution times than 20 or 30. We also used cubic and linear interpolation in algorithm NIC 
and MGC, respectively (see §4). The numerical results are presented in Table 7. We obtain a 
speedup of a factor of 10.5 when using four levels, thus significantly improving the maximum 
speedup obtained with Bi-CGSTAB/GS. Using damped Newton multilevel iterations and 
GMRES/GS as the linear smoother, the whole numerical solution for the flame sheet problem on a 
129 x 161 grid is obtained in about 9 minutes on our workstation. On a supercomputer, the CPU 
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Table 7: Numerical results for damped Newton multilevel iterations 


Operation 

Levels 

2 

3 

4 

smooth(l) 

218 

216 

219 

smooth(2) 

2272 

565 

585 

smooth (3) 

— 

1179 

1159 

smooth (4) 

~ 

— 

1020 

Speedup in time 

5.1 

9.9 

10.5 


Smooth (z) represents the total number of GMRES/GS steps done on level i during the steady state 
Newton iterations. The speedups are with respect to the unigrid solution time in Table 3. 


times will drop dramatically. 


6. CONCLUSIONS 

In this paper, we presented a new numerical procedure to solve flame sheet problems. The 
governing equations use the vorticity-velocity formulation of the Navier-Stokes equations coupled 
together with a conserved scalar equation. By appropriately monitoring the norm of the correction 
vector in the damped Newton iteration, significant savings in the overall execution time can be 
obtained. These performances can be further improved by combining the above numerical 
procedure with one way nonlinear multigrid and damped Newton multilevel iterations. The latter 
approach yields lower execution times than the former but at a higher cost in storage. With four 
levels of grids, a speedup of 5.8 is obtained with a one way nonlinear multigrid and 
Bi-CGSTAB/GS as the linear smoother. Similarly, damped Newton multilevel iterations and 
GMRES/GS as the linear smoother obtain a speedup of more than a factor of 10. For three 
dimensional problems, we should obtain speedups much greater than 10. 
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SUMMARY 

A key ingredient in the simulation of self-gravitating astrophysical fluid 
dynamical systems is the gravitational potential and its gradient. This paper 
focuses on the development of a mixed method multigrid solver of the Poisson 
equation formulated so that both the potential and the Cartesian components 
of its gradient are self-consistently and accurately generated. The method 
achieves this goal by formulating the problem as a system of four equations 
for the gravitational potential and the three Cartesian components of the 
gradient and solves them using a distributed relaxation technique combined 
with conventional full multigrid V-cyles. The method is described, some 
tests are presented, and the accuracy of the method is assessed. We also 
describe how the method has been incorporated into our three-dimensional 
hydrodynamics code and give an example of an application to the collision of 
two stars. We end with some remarks about the future developments of the 
method and some of the applications in which it will be used in astrophysics. 

1. Introduction 

In recent years a number of astrophysicists [l]-[7] have developed simulation 
tools which build in increasingly realistic physics. The present work grew 
out of an ongoing effort by us to incorporate enough physics and to realize 
that physics with robust algorithms so that we can simulate both existing 
observed phenomena and make reliable predictions which the astronomers 
can utilize in making better observations and interpreting those 
observations. The ubiquitous existence of fluids and gravitation in the 
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universe demands that, if we are to have even the most rudimentary 
simulation code, it must incorporate at least interacting fluids and 
gravitational physics. In this work, we restrict our attention to the 
weak-field, Newtonian limit of gravitation. The hydrodynamics code we 
have created also builds in the effects due to the Special Theory of 
Relativity, so the description of high speed phenomena is included. The 
restriction to weak-field gravity implies that the gravitational field is 
determined by the gravitational potential, which must be a solution to 
Poisson’s equation in three dimensions subject to Dirichlet boundary 
conditions at the edges of the computational volume. In the coupled 
hydrodynamic-gravitational system, not only the potential but also its 
gradient is needed. The gradient contributes to the fluid’s acceleration due 
to its self-gravity, inducing the momentum components to change. 

The traditional procedure is to determine the potential by solving the 
Poisson equation with given Dirichlet boundary condition, then construct 
approximations to the components of the gradient via finite differencing the 
potential. However, in simulations of astrophysical gravitating fluids the 
development of quite complex flows must be anticipated. Examples from 
astrophysics include supernova explosions, gravitational collapse 
propagation of high-speed jets from active galactic nuclei, star collisions 
and disruptions in dense star clusters, and realistic models of the early 
universe. For most of these simulations, we need to compute the gradients 
of the gravitational potential as accurately as possible, which has motivated 
our development of an alternate approach to the gradient computation. 

Here we describe a method which can yield more robust gradients in 
systems that exhibit large variability in space. This is done using a 
distributed relaxation procedure coupled with full multigrid V-cycles and is 
described in Section 2. In Section 3 we present some tests of the method on 
three-dimensional systems. Section 4 presents our incorporation of it into 
the three-dimensional relativistic hydrodynamics code. Finally, we brie y 
describe an application of the code to the collision of two stars and 
comment on the applications for which the code can be used. 

2. The Mixed Method Algorithm 

The problems we are interested in are three-dimensional, and the results 
that we present in later sections are for such problems. However, in 
presenting the method, we will consider its two-dimensional version to 
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make the description easier to understand and visualize. All components of 
the method, of both the discretization process and the multigrid algorithm, 
have natural three-dimensional analogs. 

a. The Finite Volume Element Discretization Consider the following 
partial differential equation defined on some square domain Q in 1Z 2 : 

j - V-V0 =/ in ft, m 

| <j> = g on dfl. ' 

We let u and v denote the components of the gradient of -<j>: 

( u , v) 1 = —V<j> 

Then the partial differential equation may be written in the form of a 
first-order system in Q 

{ u + <f> x =0 (u equation) 

v + 4> y =0 (v equation) (2) 

u x + v y = / (p equation), 

with boundary condition 

<t> = g on dCl. 

Here the labels u,v, and p for the equations are introduced simply for 
convenience. To discretize this system, we follow the Finite Volume 
Element principles developed in [8] . Consider a uniform square mesh Q h 
with mesh size h that covers Q. We introduce three sets of control volumes, 
one for each of the three equations in Eq.2. These volumes are shown in 
Fig. 1. We denote by U the set of all volumes U that will be used to 
discretize the u equation in Eq.2. Similarly, we will use the notation V and 
V for the sets of volumes V and P for the v and p equations respectively. 
For our finite element space we consider the lowest order Raviart-Thomas 
elements on the triangulation given by the volumes V: 

u h is linear in x and constant in y on each P G P, 
v h is linear in y and constant in x on each P G V, 

(j) h is constant on each P G V. 

The location of the nodes for each of the unknowns with their indexing is 
also shown in Fig. 1. We can now disretize the equations. We take the u 


161 


equation in Eq.2 and integrate it over each U € U. As an example, let U tiJ - 
be the volume in U that is centered at the interior u h node We then 

have 

fu. j ( u + ^x) dxdy — 0, 

which implies 

+ 6u iJ + U i+ l.i) + M^i+lj+1 ~ &i,j+ 1) = °- 

Integrating the v equation in Eq.2 over an interior V volume yields a 
similar discrete expression involving nodal values of v h and (p h . Integrating 
the p equation in Eq.2 over the volume in "P centered at the interior cf) h 
node (i y j)] denote this volume by P»j; we get 


which implies 


f Pt } («* + v y ) dxdy = J P fdxdy 


Here, fcj is the value of / at the 4> node which results from assuming 

that / is (approximated by) a piecewise constant function on V. The only 
remaining part of the discretization involves integrating the u equation in 
Eq.2 over the “half size” U volumes on the left and right boundaries, and 
similarly integrating the v equation in Eq.2 over the “half size” V volumes 
on the lower and upper boundaries. We illustrate this process by 
integrating the u equation in Eq.2 over the volume Uij that has the 
boundary u h node (1 ,j) as the midpoint of its left edge. We have 


/uj } d” 0 x ) dxdy 0, 

which implies 

T@ u hj + u 2,j) + M^2,j+i ~ = 0 

or 

T ( 3w l ,j + u 2,j ) + h (<p2,j+l ) = WlJ+l • 

Note that (f>ij+i is on the boundary and hence is known. To summarize, the 
discretization has produced for each U volume a discrete version of the u 
equation in Eq.2, for each V volume a discrete version of the v equation in 
Eq.2, and for each P volume a discrete version of the p equation in Eq.2. 

b. The Multigrid Algorithm We assume that the reader is familiar with 
the fundamentals of multigrid methods; good references are [9], [8], [10]. We 



consider a family of uniform square grids Q h that cover our region Q, where 
h denotes the mesh size. Fig. 2 shows a coarse grid tt 2h , with twice the 
mesh size of the grid Q h in Fig. 1. On each grid £l h , we can apply the 
Finite Volume Element discretization process, and we write the discrete set 
of equations that this process generates as 

L h z h _ pH, (3) 

where z h = ( u h ,v h ,<j> h y and F h = ( fu h , fv h , f h Y and the unknowns, u h , v h , 
and <f> h , are the nodal values of the corresponding functions on the grid Q h . 
Note that the values of <f> at nodes on the boundary are known so they are 
not included in <j> h ; however, as mentioned in the last section, these 
boundary values of <f> do appear in the equations generated by integration 
over the U and V volumes near boundaries, resulting in the possibly 
nonzero terms fu h and fv h in Eq.3. In this section, we now define the basic 
components of relaxation, interpolation, and restriction that are necessary 
to implement a multigrid algorithm. 

For the equations on a grid Q h , we use a distributive relaxation process 
similar to that presented in [10]. We can think of relaxation as a three step 
process. First, we sweep over all of the u h nodes, change the value of ttjj so 
that the U equation at (i,j) is satisfied. Second, we perform a similar 
Gauss-Seidel relaxation of all of the V equations. Note that these two steps, 
the U and V relaxation, are independent of each other and could be 
performed in parallel. Finally, we step over the <j> h nodes and change the 
value of (f>ij and the values of u h and v h that lie on the edge of the volume 
Pi.j, namely and We change these five 

unknowns so that the P equation at (i,j) is satisfied and so that the 
residuals of the U equations at (i, j — 1) and (i — 1 ,j — 1) and of the V 
equations at (i — l, j) and (i — l,j — 1) are unchanged. To allow 
vectorization, the Gauss-Seidel relaxation performed in each step is done in 
a red/black ordering. 

For defining interpolation operators, we use the same principles as outlined 
in [8] . The Finte Volume Element discretization is based on finite element 
spaces for the variables u h ,v h , and 4> h , so we can simply use the relationship 
between the finite element spaces on the different grids to define 
interpolation. To define the interpolation operator for <f>, which we denote 
^ mL, we note that 4> 2h is constant on the grid 2 h volume P/,j- 



Referring to Figs. 1 and 2, we thus have the following characterization of 
4> h = /(*)»*»: 

ih ih ih i h jl 2h 

Vij — 0t+lj ~ Vij - f l — 0i+l,j+l ~ 


To define the interpolation operator for u , which we denote as I{u)\ h , we 
note that u 2h is linear in x and constant in y on the grid 2 h volume Pi t j. 
We thus have the following characterization of u h == I(u)\ h u 2h . (See Figs. 1 
and 2) 
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i+lj-l = U 
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The definition of the interpolation operator for v is similar. 


For defining restriction operators, we again use the same principles as 
outlined in [8] . In the correction scheme multigrid algorithm, which we use 
here, restriction operators are used to transfer right-hand sides and 
residuals of equations, not the unknowns themselves. The definitions of the 
restriction operators are based on the relationship between the volumes on 
the various grids. The idea is to lump several of the grid h right-hand sides 
to produce the grid 2 h right hand sides. To define the restriction operator 
for the P equation, which we denote as I{P) 2 £ , we note that a grid 2 h 
volume P/ t j wholly contains four grid h P volumes. We thus have the 
following characterization of f 2h = referring again to Figs. 1 and 

2 : 


fl K J ~ fi,j + + fij + 1 + fi+ lj+1- 


To define the restriction operator for the U equation, which we denote as 
I(U)f, we note that a grid 2 h volume U/j in the interior of Q wholly 
contains two grid h U volumes and half of four others. We thus have the 
following characterization of fu 2h = I(U) 2 ^ fu h , again referring to Figs. 1 
and 2: 


f U I,J- 1 — f U i+l,j + f U i+l,j-l + 1/2 (fll-ij + + fu i+2) j + fUi+2,j-l)- 

The relationship between U volumes at boundaries is different; for example, 
the grid 2 h U volumes on the left boundary of Q wholly contain two of the 
grid h U volumes and half of two others, yielding Figs. 1 and 2: 

/«?j- 1 = f u ij + /<i-i + 1 W u 2,j 


The definition of the restriction operator for the V equation is done in a 
similar fashion. 

3. Tests of the Mixed Method Algorithm 

A standard approach to Eq.l is to solve a discrete equation based on 
cell-centered finite differences for approximating <fi , then to use simple 
differencing of this approximation to get the components of its gradient. 

We performed some numerical tests to investigate what advantage, in terms 
of accuracy, the mixed method provides over this standard approach. These 
tests were for problems with exact solution 

<f>(x, y, z ) = sin(A;ix) sin(/c 2 ?/) sin(& 3 .z) with Q = [0, 7r] 3 . By varying fci,fc 2 , 
and k$, we were able to see the effect that oscillations in the solution had 
on the accuracy of the methods. Below are results for some of these tests 
on a grid with 32 cells in each direction. 


k 1 

k 2 

h 

MIXED METHOD 

STANDARD METHOD 

&err 

(0x)err 

<t>err 

(<f>x)err 

1 

1 

1 

7.90E - 4 

8.15 E - 4 

1.58£ - 3 

8.1522 — 4 - 

1 

16 

16 

1.4722 - 1 

1.5012 - 1 

4.59£ - 1 

4.7222 - 1 

16 

1 

1 

1.4622 - 1 

3.6112 — 0 

4.5612 - 1 

3.53 E - 0 

16 

16 

16 

1.47.E - 1 

3.60.E - 0 

4.60 E - 1 

3.60 E - 0 


Here, <f> erT and ((f) x ) err are pointwise l 2 norms of the error in <p and its x 
derivative scaled by the volume term h 3 . These results are indicative of 
results seen for other combinations of ki,k 2 , and k 2 . For smooth solutions, 
the methods give neaxly identical results. However, for oscillatory solutions, 
the mixed method gives more accurate results, particularly for <f> . 

4. Incorporation of the Mixed Method Solver into 
the Three-Dimensional Hydrodynamics Code 

a.The Physics and the Code The physics included in the present code 
consists of a perfect fluid with an adiabatic equation of state formulated in 
a generally covariant manner. The interval between events in spacetime is 
represented in the present work in the form 

ds 2 = —(a 2 — Pi/3 l )dt 2 + jij(dx l + f3 t dt)(dx : ’ + frdt). (4) 

The function a is called the lapse and represents the lapse of proper time at 
a given spatial point. The vector field f3 l is called the shift vector and 



determines how much the spatial coordinates shift from one t = constant 
slice to the next infinitesimally later one. The second rank symmetric 
tensor field 7^ is the metric tensor of the spatial geometry. In the general 
theory of relativity [12], the four-dimensional geometry of spacetime is 
dynamic and the lapse, shift, and three metric are related to the 
kinematical description of the coordinates of the observer and the spatial 
geometry. The fluid energy-momentum tensor must obey a local 
conservation law in order to be consistent with Einstein’s theory. When 
supplemented with the conservation of Baryons, the conservation laws can 
be written in the following form: 

Rest-mass conservation 


1 d 

071 dt 


+ 


d 


a 7 


2 dx 


(72 dv x ) — 0 


Internal energy equation 


( 5 ) 


cry 


idt 


d i 1 d . i ^ . 

(7 2 e) + ZITd^i^ 2ev ) = ~ P (~ 


1 d 


ay 


aj 


idt 


(7 *w) + 


1 d 


cry- 


dx 


;(fW)) 


( 6 ) 


M o m e n t u m equation 


1 d 


cry 


hdt 


(tSj) + 


1 


< 37 / dx^ 

Sid? 


wsrf) = 


— + (d + e + PW)W ] ^ j 


dx 3 

S k Si 


d 7 ' 


kl 


( 7 ) 


a dx 3 2 W(d + e + PW ) dx 3 

The variables d, e, and Si, which are used in the code, are defined as 
follows: d = pW, e = peW, and Si = (p + pe + P)ui. Here d, e, and Si are 
respectively the coordinate mass density, internal energy density, and 
covariant components of the relativistic momentum density. Eq.( 5 - 7 ) are 
the equations of general relativistic fluid dynamics in a general background 
spacetime. Since the present paper is restricted to the study of phenomena 
with weak gravitational fields, we introduce the following Newtonian 
approximations to the lapse, shift, and three-metric in Cartesian 
coordinates: 


a ~ 1 + (j> 


(8) 
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F 

la 


0 

6 ir 


( 9 ) 

( 10 ) 


The scalar field (f> is the Newtonian gravitational potential and must satisfy 
the Poisson equation 

V 2 0 = 4? xGp, (11) 

in the computational volume and Dirichlet boundary conditions on the 
volume edges. 

With the Newtonian approximation to the geometric variables it then 
follows that the self-gravity of the fluid contributes to the change in the 
momentum density through the term pV a. The value of a itself enters 
several places in the fluid equations. Thus, a complete characterization of 
the self-gravitating fluid dynamics requires both the lapse and its gradient 
vector. It is these quantities that our mixed method computes in a robust 
manner. Concerning the elements that constitute the hydrodynamics part 
of the code, the methods used may be characterized as explicit finite 
volume schemes. The physical variables d, e, and Si are the fundamental 
quantities. These variables are discretized on a staggered grid system with 
the conventions that scalar variables such as density are stored at zone 
centers, while vector variables are centered on the faces of the zones. The 
biggest challenge is by far to treat the advection of the physical variables as 
accurately as possible. This is especially true for the astrophysical 
applications, since complex flows abound. We want the code to be able to 
detect and track shocks adequately. The advection method implemented in 
the code is based on a monotonic advection algorithm due originally to Van 
Leer [II]. It is robust and tracks shocks reasonably well. The code uses 
artificial viscosity to smooth developing discontinuities over a few zones. 

For this we use an artificial viscosity pressure, which is a combination of 
linear and quadratic functions of the monotonized four-velocity differences. 
The code uses an adiabatic equation of state of the form P = (r — l)pe, 
where T is the parameter that characterizes the equation of state and can 
itself be a function of the thermodynamic variables and position. For the 
model stars we discuss here, T is chosen to be a constant. The overall 
structure of a single computational step of the code is described in [7] and 
illustrated as follows: 
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At the end of the computational step the fully updated physical variables 
are available. The Poisson_Solver routine is invoked and it is here that we 
utilize our mixed method solver, which returns <p and V</>. ' 

b. Application to Collision of Stars As a nontrivial application of the 
code, we present a summary of the results of using the mixed method 
Poisson solver in the simulation of the collision of two stars which are 
initially in equilibrium. The initial data were chosen so that the mass 
density and energy density correspond to two equilibrium sph eri cal stars. 
We have chosen the n = 1 polytropic equation of state. This equation of 
state has the following functional forms for the initial mass density and 

energy density: d = do — ^ and e = eo > where f = nr/ro and ro is 

the equilibrium radius of the star. The two model stars were placed with 
their centers displaced in the z = 0 plane. We sh ow h ere the resu lts of - 
simulations in which the radii were chosen equal to 0.26 R ao i ar and the 
central mass density do equal to 6.6 g/cm 3 . The central temperature of each 
star was chosen to be 4.0e06 K. The simulations shown Here were all done 
with a (66) 3 grid. All computations were performed on the Ohio 
Supercomputer Center’s Cray YMP8/864. The hydrodynamics part of the 
code has been highly vectorized. 










Fig. 3a shows the contours for the initial potential and its gradient 
components in the z — 0 plane for a run of an off-center collision. The stars 
were chosen initially to have a relative velocity comparable to the orbital 
velocity. Fig. 3b is a plot of the density contours and velocity field in the 
z=0 plane. Subsequent motion is induced by the combined effects of the 
initial momentum and the self-gravity of the two stars. Because the stars 
attract each other, they develop accelerations toward each other and the 
hydrodynamics that results alters the density and energy distributions. 
Typical simulations were run for at least on the order of the gravitational 
free-fall time. Given the combined interactions of the hydrodynamics with 
self-gravity, we expect disruption of the two stars if the collision is 
sufficiently violent. Figs. 4a, b show respective snapshots of the potential 
contours and gradient and density contours and velocities for late times in 
the off-center collision. 

We conclude from these simulations and others that the mixed method 
Poisson solver produces physically acceptable results when combined with 
the three-dimensional hydrodynamics. This code is currently being used to 
simulate higher resolution runs and other multiple-star systems. We will be 
using the present code to treat the collision of two neutron stars and 
compute its final state and the amount of gravitational radiation emitted 
by such systems. Such computations are of importance because they can 
shed fight on the astrophysics of the mergers of neutron stars as well as 
provide potentially important benchmarks of how much gravitational 
radiation should be expected from such encounters. 
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Figure 3a. Initial Gravitational Potential and Gradient 
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Figure 3b. Initial Density and Velocity 
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MULTIGRID METHODS FOR DIFFERENTIAL 
EQUATIONS WITH HIGHLY OSCILLATORY 

COEFFICIENTS* 


Bjorn Engquist Erding Luo 
Dept, of Math., UCLA, CA 90024 


SUMMARY 

New coarse grid multigrid operators for problems with highly oscillatory coefficients are 
developed. These types of operators are necessary when the characters of the differential 
equations on coarser grids or longer wavelengths are different from that on the fine grid. 
Elliptic problems for composite materials and different classes of hyperbolic problems are 
practical examples. 

The new coarse grid operators can be constructed directly based on the homogenized 
differential operators or hierarchally computed from the finest grid. Convergence analysis 
based on the homogenization theory is given for elliptic problems with periodic coefficients 
and some hyperbolic problems. These are classes of equations for which there exists a 
fairly complete theory for the interaction between shorter and longer wavelengths in the 
problems. Numerical examples are presented. 

INTRODUCTION 

Multigrid methods are usually not so effective when applied to problems for which the 
standard coarse grid operators have significantly different properties from those of the fine 
grid operators [1,3,7-9,11-12]. In some of these problems the coarse grid operators should 
be constructed based on other principles than just simple restriction from the finest grid. 
Elliptic and parabolic equations with strongly variable coefficients and some hyperbolic 
equations are such problems. One feature in these problems is that the smallest eigenvalues 

’This work was partially supported by grants from NSF: DMS 91-03104, DARPA: ONR N00014-92-J- 
1890, and ARO: ARM DAAL03-91-G0162. 
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do not correspond to very smooth eigenfunctions. It is thus not easy to represent these 
eigenfunctions of the coarser grids. 

We shall investigate elliptic equations with highly oscillatory coefficients, 


^ a^(x)^-u £ (x) — /(x), Oj(x) — Uj(x, ^ ) 

] j i 


(1) 


with a,j(x,y ) strictly positive, continuous and 1-periodic in y. This is one class of the 
problems discussed above for which there exists a fairly complete analytic theory such that 
a rigorous treatment is possible. This homogenization theory describes the dependence of 
the large scale features in the solutions on the smaller scales in the coefficients [2,11]. We 
shall consider model problems but there are also important practical applications of these 
equations in the study of elasticity and heat conduction for composite materials. 

In this paper we analyse the convergence of multigrid methods for equation (1) by 
introducing new coarse grid operators, based on local or global homogenized forms of the 
equation. We consider only two level multigrid methods. For full multigrid or with more 
general coefficients the homogenized operator can be numerically calculated from the finer 
grids based on local solution of the so called cell problem [2]. 

In a number of numerical tests we compare the convergence rate for different choices of 
parameter and coarse grid operators applied to a two dimensional elliptic model problem. 

The convergence rate is also analyzed theoretically for a one dimensional problem. 
If, for example, the oscillatory coefficient is replaced by its average, the direct estimate 
for multigrid convergence rate is not asymptotically better than just using the damped 
Jacobi smoothing operator. The homogenized coefficient reduces the number of smoothing 
operations from 0{h ~ 2 ) to 0(k~ 10 / 7 log h). When h/e belongs to the set of Diophantine 
numbers [4], ergodic mixing improves the estimate to 0(h ~ 6 / 5 log h). The step size is h and 

the wave length in the oscillating coefficient e. 

These results carry over to some but not all hyperbolic problems. A numerical study 
of using hyperbolic time stepping with multigrid in order to compute a steady state gives 
similar results to the elliptic case. 


TWO DIMENSIONAL ELLIPTIC PROBLEMS 


Elliptic problems on the form (1) will be considered, 

- V • a f (x,y)Vu £ = /(x,y), (x, y) G Cl = [0, 1] x [0, 1], (2) 

subject to Dirichlet boundary condition u c \ dn = 0. The function a e (x,y) = a(x/e,y/e) is 
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strictly positive and 1-periodic in x and y. From homogenization theory [2] follows, 


max \u f — u\ — > 0, as e — > 0. 

where u satisfies the following effective equation, 

d 2 u d 2 u d^u 

~ Au Ax 2 ~ ( Au + A<2) ^'dydx ~ A22 dyi = ( 3 ) 

subject to the same boundary condition. Here, 

f () K • 

Aj = j a{s 1 ,s 2 )(6 ij - -^■)ds l ds 2 , i,j = 1,2, 
and the periodic functions Kj are given by, 




da(s i ,s 2 ) 


dsj 


j = 1,2. 


For the numerical examples we shall choose a special case with diagonal oscillatory coeffi- 
cient, 

a e (x,y) = a (—~)- ( 4 ) 

From (3), we know that the corresponding homogenized equation is, 


[fj. + a) d 2 u d 2 u 


(/x + a) d 2 u 
2 ~¥y 


f{ x i y)i 


( 5 ) 


where // = m(l/a e ) 1 and a = m(a e ). Here, the mean value m(/) of a e— periodic function 
is defined as, 

m(f) = - / f(x)dx. 
t J o 

For convenience, we introduce a brief notation oi a. N x N block tridiagonal matrix T, 


T = Tridiag[T i _ 1 , T„ T i+1 ] {NxN) 


Tn t 12 

T 2 i T 22 T 23 

Tnn-i T nn _ 
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Numerical Algorithm 

The discretization of (2) combined with (4) is 

- Dla^Dlu* - Dy^Eftu* = /£. 


( 6 ) 


where = a<(x, - f - y,), = «'(*,- - 2/; + f ), i,j = 0,‘--,N.D + and £>_ are forward 

and backward divided differences, respectively; h = £ denotes the grid size. Using vector 
notation, we can rewrite (6) as 


^ 'c,h 


where 


( 7 ) 


L lth = -^Tridiag[B^_ v A h p B^\ N -i)x(N-\) 

Aj = Tridiag[-a h % _ X] , a^. + ajt + bh + (iv- 1) x (7V-i ) 

is a diagonial matrix, denoted by Bb = Diag[-hf>^ N _ i)x(JV-i) and 

U t ,h = ( U u> W 2I» ' ' * > U JV-11’ ' ’ ' ’ U IN-V U 2N-V ' ' ' > U JV-lW-l) T 

-f’e.fc = (fulfil* " ' > /&_ ii> ' ' ' 5 fiN-vflN-V " ’ ’ ftf-iN-l) 

For simplicity, we only consider the two-grid method. Denote the full iteration operator of 
this method by M . It is defined by, 

M = S^(I-I^I^L Cth )S\ 


( 8 ) 


where the restriction and interpolation operators are given, as denoted below, by the weight- 
ing restriction and bilinear interpolation operators, respectively, 

1 i n il“ I 1 ') i rh 

1 


JH = _ 
h 16 


'12 1 

r-H n 

1 

12 1' 

2 4 2 

Jh — - 

’ H A 

2 4 2 

1 2 1 

4 

-h 

1 2 1 


■ii 


The smoothing iteration operator S is based on the damped Jacobi iteration, 

S = I-uh 2 L, th . 

The coarse grid operators L H is one of the following operators: 


( 9 ) 
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Global Homogenized operator: which is the discretized form of (5) 

-0.5 (/i + a)D^D*u {j + (fi - ~ 0.5(/i + a)D^D y _u tJ = 

Written in matrix form, 

£// = Jjl Tridia 9[ B f-v A ?' B ?k f-i)x(f-i)’ ( 10 ) 

^ + a 


where 


// + a 

T 


Af = Trufca0[-^-^,2(/z + a), ] ( ^_ 1)x( w_ 1} , 


B? = 7'r ?<•//«(/[- 


a — fi /z -fi a a — (i 


2 ’ 4 J ( 2 _1 ) X ( 2 _1 )' 

Local Homogenized operator: L H has the same form as (10), except the entries for A ^ , Bj 1 
coming from the local discretized values of a t (x — y), 


l h = jpTridiag[B?_ v Af , 5j / ] ( ^_ 1)x( ^_ 1) , 


( 11 ) 


where 


= Tridxag[-af_ u >a”_ x . + a” + + b", -a*] ( M _i )x( f_i), 

= 7ric/ia 5 [-c H _ 1 ,-| ) J,cf] ( |_ 1)x( |. 1) , 


with 


6" = 


b h . + b h , + 2$(tf.,6* J 

u 1 u-l v Ij’ ij-l' 


y H 


U 17-1 V l-lj' 


Ff l L ~ (l 

c n — — - — . 


u 


^ ( C| , C 2 ) is defined to be 

Reduced Local Homogenized operator : L H has the same form as in (11), except here we ig- 
nore the cross term D*Dq. That means BP is a diagonal matrix, BR = Diag[—bR]^_ i) x (|:_i) 5 


Lh = jpTridiag[B^_ v A^ -i) x( f -1 


( 12 ) 


Sampling operator: L t H has the exact form as L t h , but values a i} , b tJ are defined on the 
coarse grids, 

Lh — L tt n (13) 
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Variational operator: 


L h = /"i,,*/* 


(14) 


Numerical Results 


In practice, it is not always easy to calculate the spectral radius p(M). Therefore, we 
study the mean rate [14] of convergence under different coarse grid operators L H . The 
mean rate of convergence is defined by 

p W- 1 - All/ 



where i is the smallest integer satisfying \\L t h u l — /d|h < 1 X 10 -6 . 

In Figure 1 , a c (x — y) = 2.1 + 2 sin(27r(x — y)/e). We plot p defined by (15) as a function 
of 7 by taking e = \/2 h, and u> in (9) is 0.095. 




(1.1) N=16 (1.2) N=32 




Figure 1: pas a function of Dotted line is for (10), solid line for (l 1 ) , dashed line for (13), and dashdot for 
(12), + for (14). (1. 1 )-(!.• 4) are for different number of grid points N. 
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It is clear that the coarse grid operators derived from the homogenized forms (10) 
and (11) are superior. The effect is more pronounced for large 7 when the eigenspace 
corresponding to large eigenvalues of ^ is essentially eliminated. For the practical low 
7 case, a study of the impact of the choices of and If? is needed. In this paper we are 
concentrating on the asymptotic behavior (large 7 ). Different / ^ and If? operators are 
briefly discussed for the one-dimensional problem. 

In Figure 2, we plot p as a function of the variable a, where L H (a ) comes from the 
discretized operator —a^.D x + D x _ + a(p — o)D x Dq — bf-D\D v ^^ u> — 0.095 and e = y/2 h. 



Figure 2: p as a function of a. Here N = 64 and 7 - 12. ’V' denote p under the different choice of Normal 
and Local Homogenized coarse grid operator, respectively. 

From Figure 2, we get further evidence of the importance of using the correct homoge- 
nized operator. Techniques based on one-dimensional analysis does not contain the mixed 
derivative term [ 1 ]. 

In order to isolate the influence of the coarse grid approximation we have kept the 
smoothing operator fixed. It obviously also affects the performance. If we use Gauss Seidel 
iteration method in (9), the convergence rate can be improved. In Table 1, we test the 
same coefficient a c (x-y) = 2.1+2sin(27 r(x-y)/c). Taking N = 128, e = y/2h, we compare 
the convergence rate by choosing damped Jacobi iteration and Gauss Seidel iteration. 


7 

5 

6 

7 

8 

9 

10 

11 

12 

Jacobi 

.5929 

.5519 

.5173 

.4863 

.4579 

.4349 

.4140 

.3950 

G-S 

.4545 

.4221 

.3922 

.3703 

.3491 

.3304 

.3158 

.3008 


Table 1: Spectral radius, two dimensional case 
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ONE DIMENSIONAL PROBLEMS 


The one dimensional equation is useful as a model for which a more complete mathe- 
matical analysis is possible. 

Consider the following one-dimensional elliptic boundary value problem with a periodic 
oscillatory coefficient, 

d , v du , , 

— 'dx^^'dx = ^ 0 < x < 1, u e (0) — u e (l) = 0, (16) 

where a c (x) = a(~) and satisfies the same assumption as above. As e — » 0, u € converges 
strongly in L to the solution u of the homogenized equation, 

d^zi 

— a ^2 = l, 0 < x < 1, a — m(l/a e ) _1 . (17) 

Subject to the boundary conditions = <f>{ 1) = 0. 

Numerical Algorithm 

Let the difference approximation of (16) be of the form: 

-a e (a:,- + p(uJ +1 -«J) + o‘(® i _p(«J-tiJ_ 1 ) = l, J = 1, ■ ■ ■ , JV - 1 (18) 

In matrix form, (18) can be written as 

L t ,hU h = 1, u h = (u^---,u h N _ 1 ) T 

where ' - " 

L c ,h ~ ^Tridiag[—a x _ x ^ a { _ r + a f , a i+1 ](;v-i)x(jV-i) (19) 

with a,j = a € (xj — |). 

We first consider a two-grid method by applying standard restriction, standard inter- 
polation operators and Jacobi smoothing iteration. 

The coarse grid operator will be one of the following: 

Homogenized operator: 

L h = -p Tndxag\- 1, 2, -l] ( |_ 1)x( ^_ 1} (20) 

Averaged operator: 

m(a c ) 

= -^3-7 , r:*a fl r[-l,2,-l] ( ^_ 1)x( ^_ 1) (21) 
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Sampling operator: L t jj has the exact form as L t ^, but only every second a j value is used, 


L„ = L, 




( 22 ) 


Variational operator : 


L h = I?L f J h „. 


h 


(23) 


Convergence Theory 

The theorem below on the convergence rate is too pessimistic in the number of smooth- 
ing iterations necessary. However, the analysis still gives insight into the convergence 
process and the role of homogenization. With L H replaced by averaging (21) the same 
analysis results in 7 = 0(h~ 2 ) which means that multigird does not improve the rate of 
convergence over just using Jacobi iterations. This follows from the effect of the oscillations 
on the lower eigenmodes. It should also be noticed that in the case (ii), the solution of ^ 
is much closer to those of L#, see [ 11 ]. 

Theorem 1 Let M be defined as in (8), with L H defined by (11). There exists a constant 
C such that, 

p{M) < Pq < 1 , 

when either one of the following conditions is satisfied: 

(i) 7 > Ch-'-^lnh 

(ii) the ratio of h to e belongs to the set of Diaphantine number, and 7 > (7/i _ 1 _ 1 / 5 /n/i. 

For details of the proof, see [10]. An outline is as follows . Separate the complete eigenspace 
of L e h into two orthogonal subspaces, the space of low eigenmodes and that of high eigen- 
modes. After several Jacobi smoothing iterations in the fine grid level, the high eigenmodes 
of the error are reduced, and only the low eigenmodes are left. Combining eigenvalue analy- 
sis with homogenization theory [ 11 ], one may realize that the low eigenmodes of the original 
discrete operator are close to those of the corresponding homogenized operator. We then 
approximate them by the corresponding homogenized eigenmodes and correct these in the 
coarse grid level. 
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Numerical Results 


In Figure 3.1 and Figure 3.2, a c (x) = 2.1 + 2sin(27r xje). We plot the analogous graph 
to Figure 1. Here e = \/2 h and w in (9) is 0.1829. In Figure 3.8 and Figure 3.4, a c (x) = 
2.1 + 2 sin(27rx/(. + tt / 1 ) . Here e = 4 h and u) = 0. 1585.. 






Figure 3 : p as a function of 7 . Dotted line is for ( 20 ), solid line for ( 21 ), dashed line for (22), and dashdot for 
(23). (3.l)-(3.4) are for different number grid points N. 


In Figure 4, with the assumptions in Figure 3. 3-3. 4, we plot a c (x) and the approximation 
of (18) under the choices of coefficients in Figure 3. 




Figure 4: (4.1) and (4.2)are the graph# for o*(:r) t where * are the discretized values. (4.2) is the solution. 
Dashed line is for (17). Dashdot line is for — = 1 and line with circles is for (18). 
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In Figure 5, we plot p as a function of the variable a, where Lfj — In (5.1), 

a e (x) = 2.1 + 2sin(2nx/e), u = 0.1829 and e = s/2 h; In (5.2), a f (x) = 20.1 if x/t - [x/e] G 
(0.7, 0.9); otherwise, 0.1, u> = 0.0373 and e = 4h. 



are given. Here y = 10 and = 256. 

In Figure 6, we present the convergence u t — ► u, as e 0 by giving the numerical solutions 
of (16) and (17). Recall that our goal is to solve the oscillatory problem and to use the 
homogenized operator only for the coarse grids. 




Figure 6: Solid lines are the approximations for (18), dashed lines are the solutions for (17), respectively, 
when e = 0.2 in (6.1) and e = 0.1 in (6.2). Here M = 500. 




HYPERBOLIC PROBLEMS 


Time evolution of a hyperbolic differential equation can be used for steady state computa- 
tions. This is common in computational fluid dynamics, [6], In multigrid this means that 
hyperbolic timestepping replaces the smoothing step. There are fundamental differences 
with standard multigrid for elliptic problems but some of our earlier discussions carry over 
to the hyperbolic case. The dissipative mechanisms for hyperbolic problems are mainly the 
boundary conditions. Consider using the model problem, 

d du 

1 -'<*>■ 05121 <24) 


as the smoothing equation in multigrid for the numerical solution of (16), subject to the 
boundary conditions 


u, 


( 0 ) = 0 , 


duA 1 ) 

dx 


= 0 . 


(25) 


The equation (24) must have boundary conditions which are dissipative but reduce to (25) 
at steady state, see [5], 


u t (0,t) = 0, 


du«(l,0 

di 


+ \Ja c { x ) 


duJXt) 

dx 


= 0. 


(26) 


The initial condition should support the transport of the residual to the dissipative 
boundary x = 1, 

u c (x, t) — u°(x) given , 


u, 


+ At) = u®(x) — A i\J a e {x)D*v?(x). 


Note that the initial condition approximates the transport equation u t + yfau x = 0. The 
difference approximation of (24) needs a low level of numerical dissipation. 

The homogenization theory of [2] is also valid for equation of the type (24). A numerical 
indication is seen in Figure 7. 

The positive effect of multigrid on the convergence rate does not carry over to problems 
for which the steady state is hyperbolic or contains hyperbolic components. If, 


du f du f „du f 

+ + ^ aiT 


0 


is used for the equation, 


u c is 


du e 

1 — periodic in y , u £ (0, y, t) — a c (y). 
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The coarse grid operator must resolve all scales of a e to required accuracy in order to 
produce multigrid speed up. More on this phenomena will be reported elsewhere. 

Numerical Results 

In Figure 7, take 50 smoothing steps. Coefficient a(x/e ) is the same as in Figure 1. 






Figure 7: (7.1) Solutions: Solid line is the solution of steady state; Dashed line for homogenized solution; 
Dashed dot line for average solution. (7.2) Residue as function of two level multigrid cycles. (7 .3) Approximate 
solutions after each two level cycle. (7.4) Approximate solutions for time evolution equation. 


CONCLUSION 

Elliptic equations and some hyperbolic equations with highly oscillatory coefficients 
have been studied. We have shown that the homogenized form of the equations are very 
useful in the design of coarse grid operators for multigrid. 
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The evidence is from a sequence of numerical examples with strongly variable coefficients 
and to some extent from theoretical analysis. The result is clear in the asymptotic regime 
of many smoothing iterations. 

The impact on the coarse grid operator from the numerical truncation error and the 
interpolation operator needs to be asessed in order to improve the performance in the 
regime of very few smoothing iterations per cycle. 
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SUMMARY 


We will describe a finite difference code for computing equilibrium configurations, of the 
order-parameter tensor field for nematic liquid crystals, in rectangular regions, by minimization of 
the Landau-de Gennes Free Energy functional. The implementation of the free energy functional 
described here includes magnetic fields, quadratic gradient terms, and scalar bulk terms through 
fourth order. Boundary conditions include the effects of strong surface anchoring. The target 
architectures for our implementation are SIMD machines, with interconnection networks which can 
be configured as 2 or 3 dimensional grids, such as the Wavetracer DTC. We also discuss the relative 
efficiency of a number of iterative methods for the solution of the linear systems arising from this 
discretization on such architectures. 

INTRODUCTION: LIQUID CRYSTALS 


Liquid crystal based technology plays a key role in many devices such digital watches and 
calculators, active and passive matrix liquid crystal displays in laptop computers, switchable 
windows using Polymer Dispersed Liquid Crystals (PDLCs), thermometers, temperature sensitive 
films and materials such as Kevlar which employ high-strength liquid crystal polymers. In addition 
they are likely to play a key role in developments such as High Definition Television (HDTV) and 
optical communications and computing. 

Liquid crystals are so called because they exhibit some of the properties of both the liquid and 
crystalline states. In fact they are substances which, over certain ranges of temperatures, can exist 
in one or more vn.psophasp.fi somewhere between the rigid lattices of crystalline solids, which exhibit 
both orientational and positional order, and the isotropic liquid phase, which exhibits neither. 
Liquid crystals resemble liquids in that their molecules are free to flow and thus can assume the 
shape of a containment vessel. On the other hand they exhibit orientational and possibly some 
positional order. This is due to the intermolecular forces which are stronger than those in liquids 

This work was supported in part by the Advanced Liquid Crystalline Optical Materials (ALCOM) Science and 
Technology Center at Kent State University under DMR89-20147. 

a This author was supported in part by The Research Council of Kent State University. 
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and which cause the molecules to have, on average, a preferred direction. Liquid crystals may exist 
in a number of mesophases, such as the nematic, sweetie, cholesteric phases (see [3]). 

In this paper we shall confine ourselves to nematic liquid crystals, which exhibit orientational but 
no positional order We wish to study the orientational order and the inter-molecular forces that are 
present in a nematic liquid crystal material. To do this we need a quantitative measure of the 
degree of order and of the total free- energy (sum of inter-molecular forces) in the system. A typical 
liquid crystal molecule is long, rod-like and rigid. Its direction in space is given by the unit vector 
n = (nl,n2, n3). The molecule points in the n or — n direction with equal probability; therefore, 
there is no up or down direction. The director h — (nl,n2, n3) is also a unit vector showing the 
preferred average direction of the molecules at a point in the sample. The degree of order of a liquid 
crystal material at a particular point in the sample can be measured in terms of the statistical 
average of the angles 9, which molecules make with the director. A more common measure that is 
used is S :=< 3 cos 2 9 — 1 > /2, where <> is a thermodynamic or temporal average. A value close 
to 1 indicates a strong ordering of the molecules as is present in a crystalline solid. Values near zero 
indicate random ordering, such as exist in an isotropic liquid. The order parameter S depends on 
the temperature T. 

Most early theoretical and computational results on liquid crystals employed the Oseen-Frank 
theory. This assumes that the degree of order S is uniform throughout the material and seeks to 
calculate the equilibrium configuration of the material by obtaining the director field which 
minimizes the free energy functional 

F(n ) := i / {^(V • n) 2 + K 2 (n • V x n) 2 + K 3 \n x V x n| 2 }. 

2 Jn 

In an infinite bulk the preferred configuration for the director field is one of uniform parallel 
alignment. This will not normally be the case in practice, however, due to the effects of boundaries 
and external fields. This theory, while instrumental in predicting many important phenomena in 
liquid crystal physics, has some deficiencies. In particular, it is inadequate to model behavior close 
to a defect, where the order may not be uniform and the director may not be well defined. For 
example, in the presence of a radial field about a line defect this theory will exhibit a singularity at 
the core. For this reason there is increased emphasis on the more computationally complex 
Landau-de Gennes formulation. 

THE LANDAU DE-GENNES FORMULATION 


The Landau-de Gennes formulation describes nematic liquid crystals by a 3 x 3 symmetric, traceless 
tensor order parameter Q. The local orientational information is given by the eigenvectors and 
eigenvalues of Q at each point. Several behaviors can be distinguished by considering the relative 
magnitudes of the eigenvalues. The material is said to be uniaxial if Q has a unique largest 
eigenvalue, with the two other eigenvalues equal to minus half the largest one. The corresponding 
eigenvector gives the locally preferred direction . Thus this is the case which can be represented by 
the Oseen-Frank theory and in fact in this case Q can be represented in the form 

Q = ^S(3fm T - I) 

£t 
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where S is the value of the maximum eigenvalue and h is the normalized eigenvector associated with 
it. The Landau-de Gennes formulation, however, is capable of representing more complex behaviors, 
such as the biaxial case, where all three eigenvalues are distinct and the isotropic case, where all 
three eigenvalues are equal and hence, because Q is traceless, all three are 0. 

To obtain the equilibrium tensor field again seek a tensor field Q that minimizes the free energy of 
the system. In this case, the free energy can be expressed as 

F(Q) = F vol (Q) + F smt (Q) = [ f vo i(Q) + [ f sm{ (Q ) , 

Jn Jan 

where Q and dfl represent the interior and surface of the slab respectively. In this implementation 
we limit ourselves to strong anchoring on the surface of fl. 

The term F vo \(Q) gives an approximation of the interior free energy and is given by the following 
expression, (see, for instance [18]): 

fvoli.Q') • 2'kl < 2 Q / 3 -'*Qa/3.7 +-I'2<3 q/3,/3<5q7,7 +-^LiQ a f3 n Q +-i4traCe(Q ) 

111 

— -Z?trace(Q 3 ) + -Ctrace(<2 2 ) 2 + -£>trace(Q 2 )trace(Q 3 ) (1) 

o 4 5 

+^Mtrace(Q 2 ) 3 + ^M'trace(Q 3 ) 2 - Axm^HaQapHp 

where L\, L 2 , and L 3 are elastic constants, A, B, C, D, M, and M 1 are bulk constants, and H, 
AXmax) and E are the field terms and constants associated with the magnetic field respectively, and 
the convention is used that summation over repeated indices is implied and that indices separated 
by commas represent partial derivatives. The surface free density / surf has the form 

/surf (Q) ■= ^Vtrace((<2 - Q 0 ) 2 ) (2) 

where Qq is a tensor associated with the type of anchoring of the surface elements and V is 
prescribed constant. In the strong anchoring case presented here Q cannot vary from Qo and hence 
Id Q /surf (Q) — 0. 

For Fell, the tensor Q(P) will be represented in the form, 

Q(.P) = 

— 9l(-P)01 + ?2(-P)02 + Q3(P)<f>3 + Qi(P)<pA + q5(P)<p5 
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similar to that in Gartland [12], where {< 7 f(P)}| =1 are real-valued functions on fl. 
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THE PHYSICAL PROBLEM 


The discretization of the full slab problem in which a finite difference approximation of the 
equilibrium configuration of liquid crystals in a slab 

= {(x, y, z) : 0 < x < a, 0 < y < b, 0 < z < c} 

is given in [7]. In this paper we shall confine our consideration to the case of an infinite slab. 
Assuming the slab is infinite in the 2 -direction and imposing boundary conditions, which do not 
vary with 2 , effectively reduces the problem to a two dimensional problem on a rectangle. 

Q, := {(x, y) : 0 < x < a, 0 < y < b}. 

The region is discretized in the standard manner by dividing the rectangle 0. into I x J regions 
v(i,j) = {(x,y ) : iAx < x < (i + l)Ax,jAy <y <(j + 1)A y} 
for 0 < i < I - 1, and 0 < j < J - 1, where Ax = a/I,Ay = b/J. 


The discrete interior free energy integral is now represented by 


/ /voi(Q) * Y, /voi (Q(x u Vi)) x volume(v(i,j)), (3) 

where the points P = (x^yj), for x { = iAx and yj = jAy, are located in the lower left-hand corner 
of the rectangle v(i,j). The derivatives with respect to x and y in (1) are approximated using 
central difference approximations. 


With the assumption of strong anchoring, a second order accurate approximation of the Landau-de 
Gennes free energy density given by 

F{Q) w /wi(Q(*«» Vi)) x volume(v{i,j)) = £ h(x uyj ) (4) 

is obtained. With the discretization (4), the problem is reduced to one of minimizing h( x ii Vj) 
overall choices of {^(x», %)}!=!• This unconstrained discrete minimization problem can be attacked 
in the standard way. That is, seek a solution of the non-linear system of equations 


dEu,k h{ Xi , yj )_ n 


(5) 


for 0 < i < /, 0 < j < J, and £ = 1 . . . 5. A standard approach to solving non-linear systems such as 
these is to use a modified Newton method (see [6]). 

Each iteration of the modified Newton method involves solving a linear system, whose matrix is the 
Jacobian of (5), and then using that solution to update the iterate and the Jacobian, after which 
the process is repeated. The system in question is a large symmetric system, but for certain values 
of the temperature it may become indefinite. In addition, it may be expected to exhibit multiple 
solutions, which may be either stable or unstable. The ultimate aim of this research is to track the 
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minimal energy states as the temperature varies and to model the resulting bifurcations and phase 
transitions. 


THE WAVETRACER DTC ARCHITECTURE 


The target architecture for this application is a massively parallel SIMD computer. A SIMD 
computer uses multiple synchronized processing elements that operate in a lock-step fashion to 
achieve parallelism. Each processing element (PE) performs the same operation at the same time on 
its local data which is either stored in its own local memory or in a shared memory. A control unit 
(CU) broadcasts instructions to the processing elements for execution. Each PE can be either active 
or inactive during a particular operation. The control unit determines which PEs are to participate 
by means of a masking function that either turns a PE on or off. Only the selected processors 
execute the instruction, while the masked processors remain idle. The control unit normally buffers 
data and instructions that will be broadcast to the processor array. A front-end computer provides 
the programming environment along with the usual programming utilities such as a debugger and a 
compiler. Program code is compiled and separated into scalar and parallel instructions. Scalar 
operations are usually executed on the front-end, thus freeing the processor array to perform only 
parallel computations. This architecture is considerably simpler to implement and program than 
the alternative Multiple Instruction Multiple Data stream (MIMD) machines, in which each 
processor can execute a different instruction. The SIMD architecture is normally used for massively 
parallel machines, having between 4096 and 65536 processors, each with local memory, connected by 
a special purpose high-capacity communication network. Early examples of this architecture 
included the MASPAR MP-1 and MP-2 and the Thinking Machines Corporation Connection 
Machine CM-1 and CM-2. 

The platform chosen for this implementation was the Wavetracer Data Transport Computer (DTC), 
situated in the Department of Mathematics and Computer Science at Kent State. This has a 
number of unique features compared with previous SIMD computers. It was designed as a low cost 
massively parallel processor, which can deliver “super-computing” levels of performance at 
relatively low cost. Unlike previous SIMD machines, which had dedicated front-end processors for 
storing scalar data and performing uni-variable (scalar) computations, the DTC uses a standard 
workstation for this purpose as well as for compilation and storage of the program. Among 
front-ends supported were the Sun 3, Sparc and Hewlett-Packard/Apollo workstations. 

The DTC is connected to the front-end by means of the industry standard Small Computer System 
Interface (SCSI), which is normally used to connect hard disks. The maximum bandwidth of this 
interface is 5 Mbytes per second. The front-end sends instructions and data to a control unit, which 
decodes these instructions and broadcasts both instructions and data to the processor array. The 
array processors are semi-custom 1.5 micron standard cell chips. Each chip contains 32 one-bit 
processors together with 2 kilobits of fast RAM for each processor, and associated control and 
memory error-detection circuitry. In addition, each processor has access to between 8 and 32 
kilobytes of private external dynamic memory depending on the configuration. Each circuit board 
consists of 128 chips. The minimal configuration, the DTC-4, has one circuit board and thus 4096 
processors. Other configurations are the DTC-8, with 2 circuit boards and 8192 processors, and the 
DTC-16, with 4 circuit boards and 16384 processors. 
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The processors on each circuit board of the DTC-4 can be configured either as a 16 x 16 x 16 cube, 
for three dimensional application, or a 64 x 64 square, for two dimensional applications. The 
DTC-8, can be configured as 16 x 32 x 16 cube or a 64 x 128 square, and the DTC-16 as a 
32 x 32 x 16 cube or a 128 x 128 square. The assumption here is that most applications correspond 
to physical problems in 2 or 3 dimensions, and thus a 2 and 3 dimensional interconnection network 
is the most efficient for their solution. This is in contrast to the Connection Machine, in which the 
processors are connected by a hypercube network. 

There are a number of factors which affect the DTC’s performance. Firstly, the speed of the 
front-end is a determining factor in the overall performance of the DTC, since all uni-variable 
expressions are processed on the front-end and, in addition, all instructions are passed from the 
front-end to the control unit. In addition, although the DTC provides efficient data movement 
along the grid, the results of propagating data to the left, for example, are undefined at the right 
boundary nodes. In addition, for problems with periodic boundary conditions it is desirable that 
the interconnection network have wraparound , in which one can propagate values from one 
boundary to the other. This is not provided. This also poses a problem for periodic geometries such 
as spherical or ellipsoidal. One other inconvenience is that there is no microsecond timer on the 
DTC and all timings must be done on the front end. 

The traditional mode of solution of problems on a SIMD machine involves assigning one processor 
of the array per node in the problem space. To provide the ability to consider problems with more 
nodes than are available in the array, the DTC provides the ability to partition the memory of each 
processor to provide a larger number of virtual processors. There must be the same number of 
virtual processors for each physical processor. The number of virtual processors per physical 
processor is called the virtual processor ratio. The controller automatically issues instructions to the 
array once for each partition. Thus the execution time may be expected to increase linearly with 
the virtual processor ratio. 

The Wavetracer used in the results presented here was a Wavetracer DTC-4 with a Sun 3/50 front 
end. Current codes are bring run on a Wavetracer DTC-16 with a Hewlett-Packard/ Apollo 705 
front end. For the minimization problem we are considering, each discretization point, P , of the slab 
is associated with a virtual processor. Since the virtual processors are arranged in a rectangle or 
cube, similar to the actual processors, this provides an entirely natural mapping of the domain onto 
the rectangular grid of the DTC, provided an equal number of grid points are used in each direction. 

At each point P of the slab the tensor order parameter Q is defined in terms of the 5 unknowns 
{qr£(P)} <= i 5 . In our implementation, each set of 5 unknowns {q^(P)}(’=i ,5 is stored in a single virtual 
processor. Associated with each unknown q((P) there is also a corresponding row of the Jacobian 
matrix. The nonzero constants of that row are also stored in the memory of the processor 
associated with P. Each non-zero constant, in a row of the Jacobian associated with P , also 
corresponds to another virtual processor (which in turn corresponds to a discretization point) to 
which the values of {qe(P)}e= 1,5 at P must be communicated when the Jacobian matrix is updated. 
The set of processors with which a given processor, P, must communicate in order to update its row 
of the Jacobian is called the stencil of P. If the stencil of any processor is large, then the process of 
updating the Jacobian at each step of Newton’s method will be expensive. Fortunately, the finite 
difference approximation described here yields a relatively small and compact stencil. In the 
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problem discussed below, the stencil will at worst consist of the nine points which correspond to 
processors at most two steps away from the given processor. 


SOLUTION OF THE MINIMIZATION PROBLEM 


Non-linear iterations 


The minimization of the free energy can be carried out by solution of the corresponding discrete 

Euler-Lapange equations given by (5). These give rise to a coupled system of five non-linear elliptic 
partial differential equations. 

An alternative approach is to compute the Euler-Lagrange equations from J Q f vo i(Q). Discretizing 
these produces a system similar to (5). In this case central differences are used for the unidirectional 
partial derivatives. Two alternative choices of discretizations for the mixed derivatives, both having 
the same accuracy, are considered. One produces a seven and the other a nine point stencil at each 
nodal point in the domain. Since nearest-neighbor communications are efficient on the Wavetracer’s 
mesh array of processors, the communication costs are minimal. A reduced model in which 
L 2 — L% — D — M = M was also considered. This is significantly less complex and gives rise to a 
five point scheme. Results for this case were considered in [7]. 

In all cases the resulting non-linear system of equations was solved using a (modified) inexact 
Newton method. Let G : R n — ► R n be a function representing the discrete Euler-Lagrange 
equations. There are a total of 5(1 — 1)(J — 1) non-linear equations in this system. The function G 
depends on the 5n unknowns 


G(x) = G(q 1 1 ,.. 


1 q\,q{, 




„Tl 

1 9i > 


■9”), 


where n = (I - 1)(J - 1) is the number of nodal points. Let G"(x) be the Jacobian of the system of 
equations. Newton type methods require solving a large sparse linear system G'(x. k ) s k = —G(x k ) 
and then updating the unknowns appropriately. 


In theory, Newton’s method requires the exact solution to the linear system for each Newton 
iteration. Inexact Newton methods use some form of iterative procedure to solve the linear system 
approximately. Several iterative techniques such as SOR and multigrid were tested on this problem 
with varying success. Note that the matrix A G' (x . k ) is singular at bifurcation and turning points 
and can be indefinite near these points. This can cause convergence problems when solving the 
inner linear system. It is well known that in the early stages of the Newton or outer iteration 
process, the linear system need not be solved to full accuracy, since x fc is relatively far from the true 
solution x*. Thus only a few inner iterations of the linear solver need to be performed. In later 
stages, the inner system will need to be solved more accurately. This is precisely the philosophy of 
the inexact (modified) Newton method. A common criterion used to determine how many inner 
iterations are needed is as follows. In the k th iteration, compute a value n k e{0, 1) which is an 
acceptable bound on the relative residual. Common choices for this are n k := jr+r, u k := > an d 
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n k := min{||G(x fc ) H,^}. For these problems the second of the above choices proved on average to 
give the best results. The update s k was then determined by: 

HG(xfc) + G'(x fc )sfc|| (6 

l|G(x t )|| 

Expression (6) may be interpreted intuitively as indicating that one should iterate until the inner 
residual becomes “small” enough, then do an update. 


Linear System Solvers 


Several classical iterative schemes were used to solve the inner sparse linear system for ft, , <75 at 
each nodal point. Each method had certain advantages and disadvantages when used as a solver on 
the Wavetracer. The following schemes were evaluated : 


1. Multi-color SOR 

2. Nested (multilevel) multi-color SOR 

3. Preconditioned conjugate gradient 

4. Multigrid (V-cycle) 

5. Nested (multilevel) multigrid 


All were implemented as both point-iterative and as block-iterative methods by blocking the 
9i? • • • 1 95 at each nodal point. In the point iterative methods one solves for each q, sequentially, 
using the best available values for the i. The block method involves solving a 5 x 5 dense 

linear system at each node. 

A multi-coloring scheme was used for the SOR iterations [17] in order to introduce parallelism into 
the method. One should recall that, with red-black ordering, the Gauss-Seidel method decomposes 
into two Jacobi steps on the half size systems resulting from the coloring. Unlike the original 
Gauss-Seidel method, the Jacobi method is highly parallelizable. The multi-colored SOR produces 
similar benefits. In the case of the reduced model with the five point stencil only two colors were 
needed. Results for this case are given in [7]. In the full model, three colors are required for the 
seven point stencil and four colors for the nine point stencil. The parameter u> for the SOR method 
was chosen as the optimal parameter for the simple Laplacian model since the matrix in our linear 
system has a similar structure to the Laplacian matrix. Numerical experimentation showed that 
this was a good choice for our reduced model and gave good convergence results. 

Preconditioned conjugate gradient [17] using several pre-conditioners was tried and the performance 
of all were essentially similar. The results are presented here for symmetric multi-colored SSOR 
[17], which is simple to implement and easily parallelizable. 

Multigrid methods [2, 14, 16] were also implemented for these problems. The multigrid 
implementation discussed here uses a single V-cycle in the inner iteration for each Newton outer 
iteration. The Gauss-Seidel iteration is used as the relaxation method on the fine and intermediate 
coarse grids. The Gauss-Seidel method was chosen over the SOR method for the fine and 
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intermediate grids because of its better sw.oot.hing property; that is, it eliminates the high frequency 
components quicker in the early iterations than the SOR iteration. This is important because a few 
iterations are performed on these grids per cycle of the multigrid algorithm. The relaxation 
parameters u x and v 2 were usually taken to be equal to 3. Multicolored SOR iteration was used to 
solve the problem on the coarsest grid which was usually taken to be of size n = 4. The problem 
was solved to the level of the truncation error with usually just a few iterations. The numerical 
simulations were mostly done on the two-dimensional problem of size n = 64, meaning 65 grid 
points in both the x and y directions. Some smaller and larger problems were also examined, but 
with the minimal configuration of the Wave.trar.er, the DTC-4, available at the time, the n = 64 size 
problem was the largest that could be simulated for the full liquid crystal problem using the 
multigrid method. 

The implementation of the multigrid algorithm on the Wavet.rac.er assigns a processor (virtual or 
physical) to each grid point on the finest mesh, including the boundary grid points. The model 
simulations all assume Dirichlet (strong anchoring) boundary conditions, so the boundary 
processors are used mainly to store the boundary data. The Wa.vetracer uses a multi-array data 
structure to hold the values for each grid level. Because of the restriction in the MultiC language 
that each multi-variable in the executing program must be of the same size, this implementation 
was deemed to be the most efficient and easiest to implement. One problem with this 
implementation is that many processors are idle when solving on the coarser grids. The multigrid is 
thus not a fully parallelizable method using this implementation because not all processors are 
being utilized. Alternative variations have been proposed to overcome this problem. Data transfers 
between grids are fast since they are handled within processor memory and no communications 
between processors is required. Communications are required when computing the weighted 
averages for the restriction operation, but the actual transfer of data to the coarser grid is all done 
within processor memory. Another drawback to this implementation of the multigrid method is 
evident when one solves the n — 64 size problem in two-dimensions. The physical two-dimensional 
processor grid on the Wavetracer contains 64 processors in each dimension for a total of 4096 
processors. The n = 64 multigrid problem requires 65 mesh points in each of the x and y directions. 
This causes the Wavetracer to operate in virtual memory mode. Since each physical processor must 
contain the same amount of virtual processors, many virtual processors will remain idle during the 
iterations, resulting in a great loss in efficiency. In addition, since the available memory associated 
with each physical processor is divided into two halves, one for each of the virtual processors, the 
maximum problem size, which can be solved, is diminished. Naturally, the solution would be to 
define a slightly smaller problem of n = 63 that would not have this difficulty. The problem then 
becomes one of how to define the series of coarser grids. In our original definition of the coarse grids 
we let each grid size be a power of two. This greatly simplifies the construction of the grids and 
provides the necessary symmetry to allow us to assign processors to the different grid levels in the 
manner described. Data transfer between grids is also extremely simple, since it is all handled 
within processor memory. Defining the coarse grids in any other way would greatly complicate the 
programming process and would require many more computations and inter-processor 
communications. 

Another solution to this problem would be to use the boundary processors to not only store the 
boundary data but also to take part in the iteration process. This means that now each boundary 
processor would really represent two grid points in the mesh instead of only one. This would solve 
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the virtual processing problem because one would actually need only 63 physical processors in eac 
direction for the n = 64 problem. However, another problem presents itself because of the SIMU 
nature of the Wavetracer. In a SIMD environment each processor must perform exactly the same 
operation as all the other processors, except on a different set of data. The boundary processors as 
defined above would have to be treated separately from the interior processors because m the 
communications stage of the algorithm they are not performing the same operation. An interior 
processor must communicate with its four nearest-neighbors in a five-point stencil scheme, whereas 
a boundary processor would only have to communicate with a subset of its neighbors since the 
boundary data that it needs to do its update is stored in its own memory. In the two-dimensiona 
mesh the processors on each of the four edges of the grid must be treated separately as must t e 
four corner processors. In a naive implementation these sets of processors would be handled 
sequentially in the iteration process, greatly slowing down the computations. In fact, if the obvious 
choice is made, this could increase the update time nine times, which is considerably more than the 
increase incurred by virtual processing. Unfortunately this problem is not so easily avoided when 
one considers general boundary conditions rather than Dirichlet conditions. 

Another alternative approach is to use a Black Box multigrid method similar to t ha t m Dendy [4, 5]. 
This eliminates the restriction that the number of unknowns in the finest grid should be 2 +1; or 
some k. In addition, by storing the interpolation operators explicitly, it allows the incorporation of 
the boundary conditions, for example, by using extrapolation at the points closest to the boundaries. 
Thus the boundary conditions are incorporated algebraically rather than by using the difference 
equations directly. This does involve extra storage and in the SIMD case loss of parallelism due to 
grid point dependent code. However judicious coding, involving initialized multipliers, can reduce 
the latter effect at the expense of some further storage. There is reason to believe that, for m ° st 
problems of this type and most geometries, the increased storage will be less than 100% and thus 
that a code of this type will consume less storage overhead than one involving virtual processing. 


The philosophy behind the nested or multilevel schemes [1, 13] is as follows. The problem is solved 
on a coarse grid to a certain precision. The results are then interpolated to a finer grid and used as 
initial starting values for the solution process there. A sequence of successively finer grids is used, 
the finest is the one on which the result is required. It is hoped that providing good initial guesses 
will reduce the amount of work needed to obtain the desired accuracy on the finer grids. This effect 
is observed in the numerical simulations. The multilevel methods suffer the same kinds of problems 
that the multigrid iterations suffered when implemented on the Wavetracer. The different levels are 
implemented using a multi- variable array (in the MultiC language) with the physica (or vir ua ) 
processors assigned to the grid points on the finest grid level. This means that when one iterates on 
the coarsest level, many virtual processors will be idle. The interpolation of results between gri s is 
fast because it is all done within the processor and no inter-processor communications are necessary. 
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NUMERICAL RESULTS 


Laplacian and Scalar Liquid Crystal Problem 


Laplacian in Two Dimensions 

The model Laplacian problem in two-dimensions is given by: 

— u X x ~ %y = f(x, y), u — g( x, y) on the boundary of Q. (7) 

Dirichlet boundary conditions are assumed and Q is taken to be the unit s quare. The performances 
of the various iterative methods previously discussed are compared for problems of size n = 63 for 
the one-level schemes and n = 64 for the methods using more than one level. For these simulations 
we also assume a known true solution given by 


u = x 2 y 2 (8) 

which makes the right-hand side of equation (7) 

f{x,y)= —2.0 * (x 2 + y 2 ). (9) 

With this known solution one can compute the error as well as the re.sid.ual after each iteration in 
order to observe the convergence. The boundary values are set to the known true solution and an 
initial guess of u = 0.0 is used at all interior grid points to start the iterations. At each iteration the 
maximum absolute error and residual (infinity norms) calculated over all interior grid points are 
monitored. 

The Wavetracer DTC does not itself contain a micro-second timer. Consequently, all timings must 
be performed on the Sun 3/50 front end. The columns real, user and syst give the real (wall clock) 
time, the time spent in systems tasks related to the program, including input/output, and the time 
spent in executing user code on the front end. The input/output time includes time spent accessing 
the SCSI bus and thus time spent sending instructions from the front end processor to the sequencer 
of the Wavetracer. User time includes time spent executing the sequential parts of the program. The 
majority of the remaining real time is time elapsed while the DTC is executing parallel instructions. 

The results of these simulations are given in Table 1. Given the initial guess u = 0.0, the maximum 
initial error is 1.0 and the maximum initial residuals are approximately 7934 and 7684 for the 
n = 64 and n = 63 size problems, respectively. The iterations are continued until the maximum 
absolute error is reduced by about a factor of 10 5 . A red-black scheme is implemented for all the 
iterations (except Jacobi) to induce parallelism into the methods. The red-black coloring scheme is 
appropriate since the model Laplacian problem uses a 5-point stencil for processor communications. 
The iterations are done on the Wavetracer using a 64 x 64 physical two-dimensional grid of 
processors. 
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Table 1. Timings for the Model Laplacian Problem on the Wavetracer DTC 



real 

user 

syst 

max. residual 

max. error 

iterations 

Jacobi (n=64) 

150.8 

9.9 

67.0 

l'5(-3) 

3.5(-5) 

6901 

Jacobi (n=63) 

132.6 

13.5 

78.4 

2.1(-3) 

3. 5 (-5) 

6689 

Gauss-Seidel (n=64) 

89.1 

7.5 

40.5 

1.5(-3) 

3. 5 (-5) 

3451 

Gauss-Seidel (n=63) 

67.3 

8.2 

39.3 

1.5(-3) 

3.5(-5) 

3345 

SOR (n=64) 

3.2 

0.4 

1.5 

6.2(-2) 

3.4(-5) 

113 

SOR (n=63) 

3.2 

0.4 

2.0 

5.8(-2) 

3.5(-5) 

111 

Pre-cg (n=64) 

6.1 

0.5 

1.0 

7.2(-2) 

2.5(-5) 

32 

Pre-cg (n=63) 

2.7 

0.5 

1.0 

6.8(-2) 

3 l(-5) 

31 

Multigrid (n=64) 

2.5 

0.3 

0.5 

3.4(-2) 

3.5(-5) 

3 V-cycles 


As expected, the Jacobi iteration is the slowest to converge. Even though it is completely 
parallelizable on the Wavetracer, its slow rate of convergence does not make it competitive. The 
Gauss-Seidel method converges in about half as many iterations as the Jacobi method. This is 
expected for the model Laplacian problem. Since the Gauss-Seidel iterations are implemented in a 
red-black ordering, each iteration takes slightly longer than a Jacobi iteration. For both the Jacobi 
and Gauss-Seidel iterations the real running times for the n — 63 size problem axe faster than those 
for the n = 64 problem. This is because the n = 64 problem uses virtual processors whereas the 
n = 63 problem fits the physical grid of processors precisely. 

The SOR method greatly improved the convergence of the problem. It needed only 113 iterations to 
get to the same level of error as the previous two iterative methods (for the n = 64 problem). The 
real times, user, and systems have also been significantly reduced. This agrees with the theoretical 
results for the behaviour of these three iterative methods on the model problem. 

The preconditioned conjugate gradient iteration was implemented using a red-black coloring scheme 
and Symmetric SOR as the preconditioner. The method is competitive with the SOR iteration for 
the n = 63 problem. It is, however, slower than SOR for the slightly larger problem. 

To make a fair comparison, one must compare the multigrid algorithm with the n = 64 size 
problems of the other four iterative methods, since multigrid was implemented using a finest grid of 
this size. As one can see from the table, multigrid converges significantly faster than Jacobi, 
Gauss-Seidel and preconditioned conjugate gradient, and slightly faster than SOR (in real time). It 
even beats the other four methods when they are run on the smaller problem. This shows that 
multigrid is a very competitive method even with its limitations as discussed previously. Only three 
V-cycles are needed to reduce the error to the desired level. Five levels were used (n = 4 at the 
coarsest level) with v x = i/ 2 = 3. 

Scalar Liquid Crystal Problem 

The scalar analog to the full liquid crystal problem is of interest because it has a similar structure 
to the full model. Various algorithms for solving the full model are first developed for the scalar 
problem. The relative performances of these algorithms were basically the same for both models. 
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(10) 


The free-energy density for the scalar-field analogue to the full systems model is given by: 

fig) = \u\Vq\ 2 + ~Aq 2 - ±Bq 3 + hoq* - H 2 q 

where L\ is an elastic constant, A,B,C are bulk constants and H is a. field term representing an 
outside field such as a magnetic field. To minimize the free-energy of the system one needs to solve 
the Euler- Lagrange equation 


-L x V 2 q + Aq - Bq 2 + Cq 3 = H 2 . (11) 

Equation (11) is non-linear in the scalar variable q. The resulting linear system that needs to be 
solved at each Newton step is very similar in structure to the Laplacia.n problem. The only 
difference is the additional terms on the diagonal elements of the A matrix that is a result of the 
non-linearity of the scalar problem. The discretization of the scalar Euler-Lagrange equation 
produces a 5-point stencil at each mesh point. The communications pattern is thus the same as it 
was for the Laplacian problem. A red-black coloring scheme is sufficient to induce parallelism into 
the iterative solvers used. 

For the problem used in these tests L\ = 1.0, A = B = C = 1.0, H — 0.0, with Dirichlet boundary 
conditions given by : 

q = 1 on x = 1 and y = 1 , q = x on y = 0 ,q = y on x = 0 . 

The true solution to this problem is not known, therefore the error cannot be computed. The 
maximum absolute residual at each iteration is used to monitor the convergence. The initial guess is 
given by q=0.0 at each interior mesh point and iteration proceeds until the maximum absolute 
residual is reduced by approximately factor of 10 6 . The initial residuals for the n = 64 and n = 63 
size problems are 8192 and 7938, respectively. Table 2 gives the results of the simulations. 


Table 2. Timings for the Scalar Liquid Crystal Problem on the Wavetracer DTC 



real 

user 

syst 

max. residual 

max. error 

outer iter. 

SOR (n=64) 

6.2 

0.3 

1.4 

2.7(-3) 

— 

9 

SOR (n=63) 

3.3 

0.3 

1.4 

2-4(-3) 

— 

9 

Pre-cg (n=64) 

12.0 

0.8 

1.2 

2.5(-3) 

— 

8 

Pre-cg (n=63) 

5.2 

0.8 

0.9 

2.1(-3) 

— 

8 

Multigrid (n=64) 

4.0 

0.5 

0.6 

2.3(-3) 


4 

Nested SOR (n=64) 

6.8 

0.5 

1.7 

5.9(-3) 


19(4,3,4,4,4) 

Nested Multigrid (n=64) 

4.0 

0.4 

0.7 

5.9(-3) 

— 

8(4, 1,1, 1,1) 


Comparing the real execution times of these algorithms shows again that the preconditioned 
conjugate gradient method is not competitive on this kind of architecture. The nested (multilevel) 
methods use five levels with the coarsest level being of size n = 4. The numbers in parentheses in 
the last column of the table are the number of outer Newton iterations needed to achieve 
convergence at each level. 
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We refer to a nonlinear Newton based analogue of the Pull Multigrid (FMG) method as nested 
(multilevel) multigrid. It employs Newton iteration on each mesh level until the desired accuracy is 
attained. In the case of the example considered here, this required 4 outer iterations on the coarsest 
mesh and one each of the finer mesh levels. With the exception of the coarsest level, a V-cycle with 
Ul = u 2 = 3 is applied at each level to solve the linear system arising from the Newton process. It is 
considerably faster than nested SOR in achieving the same reduction in the residual. Thus with the 
exception of the initial 4 Newton cycles, it is a natural nonlinear analogue of the Pull Multigrid 
method (FMG) [16, p.22]. Multigrid (non-nested) seems to perform the best since its timings are 
essentially the same as nested multigrid but it has greater residual reduction. The SOR (ti 63) 
iteration has the fastest real time but on a smaller problem where no virtual processing is involved. 

Note that the optimal w from the Laplacian model was used in the SOR iterations for the scalar 
problem. The experimental results showed that this was a good choice and gave the best 
convergence over any other choice. The stopping criteria used to terminate the inner iterations for 
each Newton step was n k = l/(fc + 2). 


Full Liquid Crystal Problem 


Table 3 gives the results of the numerical simulations for the full systems model. The same test 
problem was used as in the case of the reduced model together with the appropriate Dirichlet 
boundary conditions. Only the size n = 64 problem was considered for this set of runs. The 
following set of parameter values was used: L\ = 10.0, Li = L3 = 1.0, A = B = C = D = 

M = M' = 1.0 and outside field parameters are set to zero. Results from both the 7-point and 
9-point discretizations are given. Both point and block iterative methods were compared. The initial 
maximum absolute residuals for the 7-point and 9-point schemes are 8.2(4) and 8.95(4), respectively. 
The iterations were continued until the maximum residual was reduced by approximately a factor of 
10 6 . The initial maximum error is 1.0 since initial guesses of = 0.0, i = 1, • • • , 5 were used for the 
interior mesh points. The simulations were all done in single precision. 

The SOR methods used 10 inner iterations for each Newton outer iteration. The stopping criteria 
used for the reduced model (n fe = l/(k + 2)) was too restrictive in some cases and caused 
convergence problems. Using 10 inner iterations avoided these problem areas. As before, the 
multigrid methods outperformed their SOR counterparts. The 7-point iterative scheme (point 
method) was competitive with the 9-point scheme for both multigrid and nested (multilevel) 
multigrid. This was not the case for the SOR methods. The 9-point scheme performed better for 
the one-level SOR case but did worse for the multilevel iteration. Block methods were not 
competitive for either multigrid or nested multigrid. The block method performed best for the 
single-level SOR iterations, and was also competitive in the nested case. The best algorithm for 
solving the test problem was again nested (multilevel) multigrid using the point iterative approach. 
The 9-point scheme performed marginally better than the 7-point, producing a slightly smaller 
residual, upon convergence, in about the same amount of real time. The pre-conditioned conjugate 
gradient methods were not implemented for the full model since they showed to be not competitive 
in the reduced model case [7]. 
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Table 3. Timings for the Full Systems Liquid Crystal Problem on the Wavetracer PTC 



real 

user 

syst 

max. residual 

max. error 

outer iter. 

SOR (7p) 

277.0 

8.2 

15.6 

8.17(-2) 

Hr 

42(10 inner/out) 

SOR (9p) 

148.7 

7.6 

12.9 


iiii 

34(10 inner/out) 

Block-SOR (9p) 

106.3 

6.4 

6.5 

7.95(-2) 

1.27(-5) 

17(10 inner /out) 

Multigrid (7p) 

47.2 

2.0 

1.6 

6.55(-2) 

5.36(-5) 

4(1 V-cyc/out) 

Multigrid (9p) 

42.2 

2.2 

1.9 

3.40(-2) 

4.72(-5) 

4(1 V-cyc/out) 

Block-Multigrid (9p) 

60.9 

4.1 

1.6 

2.99(-2) 

4.15(-5) 

4(1 V-cyc/out) 

Nested SOR (7p) 

85.9 

3.6 

7.5 

8. 26 (-2) 

1.36(-5) 

18(3,1,2,4,8) 

Nested SOR (9p) 

100.5 

6.2 

9.7 

6.14(-2) 

1.98(-5) 

25(3,1,2,5,14) 

Block-Nested SOR (9p) 

105.0 

7.4 

7.6 

8.23(-2) 

1.05(-5) 

18(3,1,2,4,8) 

Nested Multigrid (7p) 

37.4 

2.4 

2.8 

8.44(-2) 

6.88(-6) 

8(4,1, 1,1,1) 

Nested Multigrid (9p) 

36.3 

2.8 

2.8 

4.62(-2) 

3.86(-6) 

8(4, 1,1, 1,1) 

Block-Nested Multigrid (9p) 

50.5 

3.8 

2.7 

4.62(-2) 

3.89(-6) 

8(4, 1,1, 1,1) 


CONCLUDING REMARKS 


Multigrid methods work well as inner solvers for liquid crystal problems when implemented on 
SIMD computers with 2-D grid architectures. Multi-colored SOR methods are also effective, but 
due to the cost of inner products on such machines pre-conditioned conjugate gradient methods are 
not. The multigrid algorithms (one-level and multilevel) perform better than their SOR 
counterparts for the larger n = 64 problem. 

Although the Wavetracer’s mesh architecture fits the problem (discretization) well thereby making 
communications between nearest neighbors efficient, it is not as well suited for multigrid algorithms. 
This is due to the fact that the machine has a physical 2-D grid structure with 64 processors in each 
dimension. For multigrid and multilevel iterative schemes a grid size of 65 x 65 is required for an 
efficient implementation, because of the way the grid refinements are defined. So for an n = 64 size 
problem, the machine must to go into virtual processing mode, thus slowing down the execution 
time of the algorithm and increasing the storage overhead. One solution would be to generate grids 
that would not suffer this problem, but this involves considerably more complex coding, which 
would also increase execution time and storage overhead but not to the same extent as virtual 
processing. We emphasize that the multigrid implementation employed here is effectively the 
sequential version of the multigrid method. Thus on the coarsest mesh only 0.4% of the processors 
were active. Despite this disadvantage multigrid proved the fastest of the algorithms tested. We 
remark that although these methods worked well for the test problems, where the iteration matrix 
was positive definite symmetric, convergence problems can be expected, when the system becomes 
indefinite, due to the coarseness of the coarsest mesh. Use of a coarsest mesh with more points can 
be expected to remove this problem as well as improving the performance due to higher processor 
utilization. Further improvements in performance can be anticipated if parallelism were introduced 
using the method of [4, 5, 8, 9, 15]. 
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SUMMARY 


This paper describes a simple numerical model for hurricane track prediction which uses a 
multigrid method to adapt the model resolution as the vortex moves. The model is based on 
the modified barotropic vorticity equation, discretized in space by conservative finite differences 
and in time by a Runge-Kutta scheme. A multigrid method is used to solve an elliptic problem 
for the streamfunction at each time step. Nonuniform resolution is obtained by superimposing 
uniform grids of different spatial extent; these grids move with the vortex as it moves. Preliminary 
numerical results indicate that the local mesh refinement allows accurate prediction of the 
hurricane track with substantially less computer time than required on a single uniform grid. 


INTRODUCTION 


Accurately predicting the track of a moving hurricane is a problem of great practical 
importance. One approach is to treat the problem as one in computational fluid dynamics, taking 
observed meteorological data as initial values for a numerical model. Many factors influence 
the accuracy of this approach, including the initial data (or lack thereof), the dynamical and 
physical processes included in the model, and the numerical scheme employed. While the relative 
importance of these three factors is a subject of considerable debate, in this paper we focus on the 
third. 

Our premise is that predicting the track of a moving hurricane accurately requires resolving 
the flow field adequately on both the large scale surrounding the vortex and the small scale within 
the vortex itself. Since the spatial scales involved may differ by more than an order of magnitude, 
models using uniform resolution are inherently less efficient than what should be possible. Here, 
we use a simple dynamical model which has been used successfully by many authors (ref. 1, 2, 3), 
namely, the modified barotropic vorticity equation. However, rather than use a single uniform 
grid as in those studies, we investigate the use of adaptive multigrid techniques, with the goal 
of obtaining high accuracy at low computational cost. In the following sections we detail the 
formulation of the model, describe the mesh refinement scheme, and present some preliminary 
numerical results. 


* Work supported by the National Science Foundation under Grant No. ATM-9118966 
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MODEL FORMULATION 


Governing Equations 


We formulate the model on a section of the sphere using a Mercator projection (true at latitude 
<p = <p c ). The model consists of the modified barotropic vorticity equation 

% + m 2 J{ +/3m^ = z/m-'V 2 C, (1) 

at Ox 

where the relative vorticity C and streamfunction ip are related by 

(m, 2 V“ — 7 2 ) ip — C- (^) 

Here V 2 = d 2 /dx’ + d 2 /dy 2 , J(ip, C) is the Jacobian of (ip, <) with respect to ( x , y), ^ -29 cos <pja 
(with a and 9 the radius and rotation rate of the earth) , and m - cos <pj cos <p is the map 
factor. There are two quasi- physical parameters: the diffusion coefficient v, and the parameter 7 
(inverse of the effective Rossby radius) which helps prevent retrogression of ultralong Rossby waves 
(ref. 4). We also consider versions on the /-plane (m = 1 and (3 = 0) and /3-plane (ra = 1 
and /3 = 29 cos <p, /a). The model domain is a rectangle in x and y centered at (x, y) = (0, 0), 
where (A, (p ) = (A, , </>,). At the boundaries we specify the streamfunction ip (and thus the normal 
component of the velocity); where there is inflow, we also specify the vorticity (. 


Space Discretization 


On a single uniform rectangular grid 9^' consisting of gridpoints (x , , y , ) with mesh spacing h in 
x and y, we discretize (1) and (2) in space by finite differences as 


dCi.j 

dt 


+ mj v ,' l (ipX) + (3 l m i d 2h ip,., 




(3) 


and 

(m 2 V I, - 7~) = C ,7 . 

respectively. Here r i,j(ip, C) is the discrete Jacobian of Arakawa (ref. 5), and d 2ll ip,.j and 
Vj ip , j are the 0(h 2 ) centered difference approximations to dip/dx and the Laplacian operator, 
respectively. We apply (3) and (4) at the interior points. At boundary points where there is inflow, 
( is specified; otherwise, we predict ( on the boundaries by applying an equation of the form (3) 
but using appropriate one-sided differences. It should be noted that using the Arakawa Jacobian 
is crucial here: the fact that it conserves discrete analogues of vorticity, enstrophy (mean square 
vorticity), and kinetic energy implies that the model is not subject to nonlinear computational 
instability. 


208 


To write the space-discretized equations in a more compact form, we collect the values ip,,, and 
C, , into grid functions ip h and (' , respectively, defined on the grid ft 1, . We can then write (3) and 
(4) as 

d( h 


= F l '(ip h X 1 ') 


dt 


(5) 


and 


G h (ip h ) = C'\ (6) 

where the operators F h and G h express the space discretization described above. 


Time Discretization 


To discretize (5) and (6) in time we use the classical fourth-order Runge-Kutta (RK4) scheme. 
To describe it, we specify a time step At > 0 and introduce time levels t * = kAt for k = 0, . . . . 
Suppressing the superscript h for simplicity, we now use the superscript k to denote values at time 
level k, e.g., ip 1 ' ~ ip h {tk)- With this notation, the RK4 scheme can be written as 


where 


C* + 5 - C A 


\ At 


^+5 - 

C* 

7 At 


1 

<* 

At 


^ "f 1 



At 


= F k 

:= F(fi t C A ), 

G(V> A:+ s) 

= c A+ i 

— F a +5 

:= F$ k ^,t k+ i), 

G{ip k+ 5) 

= c A+ i 

_ 

= f a+1 , 

:= F(ip k +K C A+ 5), 

G(V» a+1 ) 

G(ip k+] ) 

= C A+1 , 
= <* + \ 


F* + l = ^ ^F k +2 F i+ 3 -|-2 F a + 5 + F a + i ^ . 


(7) 


( 8 ) 


Thus, to execute a single time step ft — > f* :+ 1 , we perform the four stages indicated in (7); each of 
these stages consists of computing F based on known values of ip and C, predicting a new vorticity 
£, and solving the diagnostic equation for the corresponding streamfunction ip. 

Although it requires four times as much work (per time step) as the second-order Adams- 
Bashforth scheme commonly used in such models, this RK4 scheme has several advantages. First, 
it allows time steps at least four times as large, so in fact it is more efficient. Second, it is more 
accurate, so time discretization errors are less likely to distort the conservation properties of the 
Arakawa Jacobian. Finally, since it is a one-step scheme, it has no computational modes and needs 
no other method for the initial time step. 
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Multigrid Solution 


To solve the diagnostic equation at each stage for the streamfunction ip, we use a multigrid 
method. For the relaxation scheme we use a point Gauss-Seidel method formulated as follows. The 
discrete (interior) equation (4) can be written as 


= 777 ~ Sij) = = F,_j, 

ti~ m~ 

where 

S,., := ip,-\ j + V’i+i../ + 1 + $>..)+ 1 

is the sum of the neighboring values of ip and 


a, := 4 + 


Thr 
m j 


( 9 ) 

( 10 ) 


( 11 ) 


is the diagonal term of the discrete Helmholtz operator. Given an approximate solution ip of (9), 
we relax at a point (i,j) by changing the value there to satisfy the corresponding equation (9); this 
results in the new value 


i>>.j = 


_ hjJjj + Stj 


( 12 ) 


where S t l is defined using the current surrounding values in (10). The corresponding residual (if 
needed) is given by 

r,.j ■= Fij - ~ (crjip'.j - S,.,) = ^4 fa.j ~ ■ ( 13 ) 

We use this relaxation (with red-black ordering) as a smoother in a multigrid method, using half- 
injection for the fine-to-coarse transfer of residuals and bilinear interpolation for the coarse-to-fine 
transfer of corrections. For the control algorithm we use repeated V(l,l)-cycles. 


LOCAL MESH REFINEMENT 


Given the premise that the flow near the center of the vortex requires much higher resolution 
than the flow surrounding the vortex, we now consider how to provide such variable resolution. 

Our basic method is essentially that of (ref. 6), constructing nonuniform resolution by 
superimposing uniform grids of varying spatial extent. Since all calculations are carried out on the 
uniform grids, programming remains relatively easy. 

To illustrate the method, let us consider first the case of two grids: a coarse grid Q’ 1 ' covering 
the whole domain O, and a fine grid Q h which covers only a portion of the domain (i.e., enclosing 
the vortex). We assume that the boundaries of the fine grid coincide with coarse gr id l ines. Th e 
model variables 4 and ip are carried on both the coarse and fine grids (denoted by C t lh , ip zh and CV 
ip h , respectively). Noting that the coarse grid allows time steps twice as large as those on the fine 
grid, we use the following basic procedure for stepping the model from time tp to : 
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] . Execute one time step of length At on the coarse grid to produce (~ l ! 1 , ip- h - L ~ 1 ; 

2. Execute two time steps of length At/2 on the fine grid to produce (,' '' ~ 1 : ip hl, + l , 
using boundary values for i p interpolated from the coarse grid (in space and time); 

3. Copy the fine-grid solution to the coarse grid at points common to both. 

Several points deserve mention here. First, in solving the implicit problem for ip on either grid, 
we use the multigrid method outlined above. This introduces additional coarse grids, e.g., a grid 
with mesh spacing 2h covering only the region of the local fine grid O 7 '. In fact, the “underlying” 
part of the coarse grid ft’ 1 could be used for this; however, the resulting complications of 
preserving interface values (for fine-grid boundary values) and restricting relaxation to only part 
of Q- h seem too high a price to pay for the relatively small savings in storage which would be 
achieved. Second, after completing the above three steps, the resulting solution on the composite 
grid ft- = ft 1 ’ U fl~ h could be further refined by applying a composite-grid discretization of the 
governing equations; this FAC (Fast Adaptive Composite grid) method and several variants are 
described in (ref. 7), and will be explored in future work. Finally, the above approach generalizes 
immediately to more than two grids. 

For the initial work reported here, we have made the following simplifying assumptions. First, 
we require the grids to be rectangular and strictly nested (i.e. , any fine grid is contained wholly 
within the interior of the next coarser grid), with one grid per level (i.e., the refinement occurs in 
one region only, surrounding the vortex). Second, we use a constant mesh ratio of two (i.e., the 
mesh spacing h on any grid is twice that of the next finer grid, if any). Finally, we will specify 
the number of grids and their sizes in advance but allow them to move following the vortex as the 
solution is computed. 

Since the problem to be solved has an easily identifiable region of interest surrounding the 
vortex, we take the following simple approach to moving the grids. First, we locate the vortex 
center on the finest grid. Then for each grid in turn, from the next-to-coarsest to the finest, we 
decide whether or not to move the grid. This decision is based on the distance of the vortex 
center from the center of the grid: if it is more than a specified fraction a of the distance L to the 
boundary, we move the grid. The move is calculated so as to “overshoot” a bit, i.e., aiming to put 
the vortex center beyond the (new) grid center by a specified fraction 6 of the distance to the grid 
boundary. Note that care must be taken at this stage to ensure the strict nesting of grids assumed 
above. Finally, the grid is moved by shifting the values which remain on the grid and filling in 
the rest by interpolation from the next coarser grid. For the results presented here, we check for 
possible grid moves after each time step on the coarsest grid, and use the parameters a = 0.4 and 
6 = 0 . 2 . 

To locate the vortex center (needed both for moving the grids as described above and for 
determining the vortex track), we first locate the point of maximum vorticity on the finest grid. We 
then interpolate the vorticity at that point and its nearest neighbors in x and y (five points total) 
by a quadratic function, and define the vortex center to be the location of the maximum of that 
quadratic. 
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RESULTS 


The initial conditions for the test problem consist of an axisymmetric vortex superimposed on 
an environmental flow, as considered in (ref. 1). The environmental flow is given by 


V'(y) = 



which corresponds to the zonal current 


u(y) = 


dip _ . 

— — U(j sm 

dy 



The tangential wind in the initial vortex is given by 


V(r) = 2V,,, 


/ M exp[— a(r/r„,) fr ] 
VrJ 1 + (r/r,„y ’ 


(14) 


(15) 


(16) 


where r = [(x — xn)‘ + (y — 3/o ) 2 3 1 ^ 2 is the radial distance from the vortex center (xu, t/o)- Note that 
V has the approximate maximum value V,„ near r = r,„ (exact when a — 0); the exponential factor 
is included to make V vanish quickly for large r. The vorticity corresponding to (16) is 


C(r) = 


d(rV) 

rdr 


V 

r 


2 

1 + (r/r,,,) 2 



(17) 


We will use the following parameter values: u,\ = 10 ms 1 and L — 4000 km for the environmental 
flow, and V m = 30 ms -1 , r m = 80 km, a = 10 -R : and b - 6 for the initial vortex. The 
computational domain is a square of side length 4000 km on a /3-plane, using 0 for the latitude 
20° N; the vortex is initially centered at xi, = 750 km and y<i = —750 km. The model was run from 
t = 0 to t = 72 hr; for simplicity we have set v = 0 and 7 = 0 here. 


To establish a standard for comparison, we ran the model with high resolution (384 x 384 grid 
with spacing h = 10.42 km and time step 10 s). We then ran the model with a variety of grid 
configurations (using up to four grids) and compared the vortex track to that of the reference run. 
Table I summarizes these results, with the runs listed in order of increasing execution time (on a 
SUN SPARCstation2). All of the cases in this table use only square grids, with N, = N ,, = N. The 
forecast error is defined as the distance between the predicted vortex location at a given time and 
that in the reference run. These results show that the local refinement process has the potential to 
substantially reduce the execution time required to achieve a given accuracy. For example, a single 
grid with h = 31.25 km (run 6) achieves errors on the order of 10-20 km; with local refinement 
(run 2) comparable accuracy is obtained with only about 36% as much computer time. Similarly, a 
single grid with h = 20.83 km (run 8) achieves errors on the order of 1-5 km; with local refinement 
(run 7) comparable accuracy is obtained with only about 42% as much computer time. In fact, 
run 7 with local refinement achieved about the same accuracy as did the single-grid run with h 
15.625 km (run 9) but with only about 18% of the computer time. In addition, the solution fields 
produced with local refinement (run 7) are smooth, as shown in Figures 1-5, with no indication of 
any problem due to the change of resolution at the grid interfaces. 
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CONCLUDING REMARKS 


The preliminary results reported here show that adaptive multigrid techniques can substantially 
reduce the computer time required to make accurate hurricane track forecasts. In addition to 
ongoing testing of the existing model, we plan to investigate the following possible improvements. 
First, we plan to include the FAC method as discussed above. This should have the advantage of 
more precise conservation of vorticity, enstrophy, and kinetic energy at the grid interfaces. Second, 
we plan to construct a fully adaptive version of the model by using the Full Approximation Scheme 
(FAS) to produce estimates of the local truncation error to be used in an automatic grid refinement 
scheme (as proposed in ref. 8). Finally, we plan to test the model using real data, and compare its 
performance to that of models currently in operational use. 
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Table I. Results of Model Runs 



Grid size(s) 

At 

Forecast error (ki 

m) at: 

Execution 

Run 

N 

h (km) 

(min) 

24 hr 

48 hr 

72 hr 

time (sec) 

1 

64 

62.50 

60 

110 

143 

47 

170 

2 

64 

62.50 

60 

11 

8 

17 

504 


64 

31.25 

30 





3 

96 

41.67 

30 

53 

12 

25 

799 

4 

32 

125.0 

120 

14 

24 

39 

916 


32 

62.50 

60 






48 

31.25 

30 






64 

15.62 

15 





5 

64 

62.50 

60 

1 

6 

10 

1,174 


64 

31.25 

30 






64 

15.62 

15 





6 

128 

31.25 

30 

11 

8 

19 

1,409 

7 

64 

62.50 

60 

1 

5 

5 

2,047 


64 

31.25 

30 






96 

15.62 

15 





8 

192 

20.83 

20 

1 

3 

5 

4,860 

9 

256 

15.62 

15 

2 

3 

4 

11,405 

10 

384 

10.42 

10 

- 

- 

- 

41,716 
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RELAXATION SCHEMES FOR CHEBYSHEV SPECTRAL MULTIGRID METHODS* 


Yimin Kang and Scott R. Fulton 
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Potsdam, NY 



SUMMARY 


Two relaxation schemes for Chebyshev spectral multigrid methods are presented for elliptic 
equations with Dirichlet boundary conditions. The first scheme is a pointwise-preconditioned 
Richardson relaxation scheme and the second is a line relaxation scheme. The line relaxation 
scheme provides an efficient and relatively simple approach for solving two-dimensional spectral 
equations. Numerical examples and comparisons with other methods are given. 


INTRODUCTION 

For limited-area problems with general (non-periodic) boundary conditions, Chebyshev spectral 
methods give exponential convergence for smooth solutions. However, except in some very simple 
cases (e.g., one-dimensional constant-coefficient problems), Chebyshev approximations usually lead 
to full linear systems which cannot be solved efficiently by direct methods, and iterative methods 
must be used. Unfortunately, designing efficient iterative methods for discrete spectral equations 
has proven difficult, especially for problems with non-constant coefficients (ref. 1). Perhaps the 
most promising technique to date for solving spectral discretizations of elliptic problems is the 
spectral multigrid method (ref. 2, 3). However, the best relaxation schemes known today are 
complicated to apply. In this paper we introduce two simpler relaxation schemes and investigate 
their performance. 


As prototype problems we consider one- and two-dimensional elliptic equations with Dirichlet 
boundary conditions on simple geometric domains. In one dimension we consider 

-u"{x) = /( x), 

■u(+l) = a, 

The two-dimensional prototype problem is 

-A u(x,y) = f(x,y), 
u(x,y) = g(x,y), 

We discretize these problems by Chebyshev collocation. For example, for the two-dimensional 
problem (2), the solution u(x, y) is approximated by a set of discrete values on the Chebyshev 

*Work supported by the National Science Foundation under Grant No. ATM- 9118966 . 


\x\ < 1, 
u(- 1) = b. 


( 1 ) 


1*1, bl < i, 

1*1 = i,M = i. 


( 2 ) 


grid {(*,-, y k ) = (cos(jir/N x ),cos(kiv/Ny)) \0<j<N x ,0<k< N y }, with the requirement that 
problem (2) be satisfied on this grid, i.e., 

uj,i b = g{xj,Vk), 

where and are values of the second-order derivatives of the Chebyshev approximation 

2n=o u mn T m {x)T n (y) to u(x , y) on the Chebyshev grid. For simplicity, we will assume here 
that N x = iVy = iV; however, the codes described in this paper do not require this. 

The discrete problem (3) can be expressed in form of a linear system 

AU-F ( 4 ) 

Unfortunately, the matrix A, formulated by Chebyshev collocation approximations, is full and 
non-symmetric. For two-dimensional problems, direct methods (like Gaussian elimination) would 
require 0{N 6 ) operations for factorization and 0(7V 4 ) for the subsequent solution, which is far too 
much work to be practical. Thus, iterative methods must be used. 


1 < j < N x ,0 < k < N y 
j = 0, j = N x , k = 0, k = N y 


THE POINTWISE PRECONDITIONED RICHARDSON RELAXATION SCHEME 


The most efficient method available today for solving (4) and its generalizations to other 
elliptic problems is the spectral multigrid method of Zang et al. (ref. 2, 3), which employs finite- 
difference preconditioned Richardson iteration as the relaxation scheme in a multigrid context. 
Preconditioned Richardson relaxation for (4) takes the form 

V V + ujH{F - AV), (5) 

where V is the current approximation to U, u> is a relaxation parameter, and h is the 
preconditioner. The criteria for choosing a preconditioner H are: 

• H should give fast multigrid convergence, 

• H should be easy and cheap to generate or apply. 

The finite-difference preconditioning of Zang et al. (ref. 2, 3) gives fast convergence, but applying it 
requires solving (or nearly solving) a finite-difference discretization on the nonuniform Chebyshev 
grid. This procedure is complicated and expensive. Are there alternatives which are simpler 
and still effective? Achi Brandt (personal communication, 1983) has suggested that pointwise 
preconditioning based on the (variable) Chebyshev mesh spacing might work well. In this section, 
we investigate the performance of this simple preconditioner when applied to the problem (4). 
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The One-Dimensional Case 


Formulation 

As an analogue of the Gauss-Seidel relaxation for a finite-difference method, the pointwise 
preconditioning for the Chebyshev discretization takes the form 

hi 

Vj < — Vj+u-^-rj, ( 6 ) 

where hj — (xj-i ~ 5j+i)/2 is the effective grid size at the point xj, rj is the the residual 
R = F — AV at xj, and a; is a relaxation parameter to be chosen to accelerate the convergence. 
Note that (6) is equivalent to choosing the preconditioning matrix H in (5) as a diagonal matrix 

B- diag(l,f %,l). ( 7 ) 


Analysis 

The evolution of the error E — V — U in the Richardson relaxation (5) is described by 

£<-—(/- ujHA)E. (8) 

Therefore, the convergence factor for (5) on a single grid is 

<?SG = p{I ~ uHA), 

where p denotes the spectral radius. Likewise, the multigrid smoothing factor for (5), when used as 
a smoother in a multigrid method (e.g., ref. 4), is 

Ji = P(G(I - u >HA)), (9) 

where G represents the perfect coarse-grid correction, i.e., set all low modes of the error to zero. 


For the simple preconditioning (7), our numerical computations show that the eigenvalues of the 
matrix HA axe all positive real numbers. The maximum eigenvalue is A max ~ 5.0, the middle is 
A m id ~ 1-5, and the minimum is A m i n « 0(A -2 ). The formulas of Zang et. al. (ref. 2, 3) then give a 
good approximation to the optimal ui and jx, namely, 


u> 


T Amid 


0.325, 


Amid 


+ Amid 


0 . 6 . 


( 10 ) 


Indeed, computing the smoothing factor directly from (9) using u = 0.325, we find that p < 0.6 for 
all N < 512. 


To take into account the effects of grid transfers (omitted in the smoothing analysis above), 
we use the following two-grid analysis. The evolution of the error E in one two-grid V (ni, re- 
cycle (where n\ and 712 specify the number of relaxation sweeps before and after the coarse-grid 
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( 11 ) 


correction, respectively) is described by the matrix 

T = (I — uHA f ) n2 {I - PA~ 1 RA)(I - ujHA f ) n K 

Here, R represents the fine-to-coarse grid transfer (we use injection), P represents the coarse-to- 
fine grid transfer (we use Chebyshev interpolation) , and A j and A c represent the discrete operator 
matrix in (4) on the fine and coarse grids, respectively. Note that (11) assumes that the coarse- grid 
problem is solved exactly. 

We computed the two-grid convergence factor cttg = p(T) for N < 512 using different 
values of to, and the numerical results show that u = 0.325 again gives the optimal convergence 
factor (or very close to it). Using that constant value, we find that the smoothing factor per sweep 
= (<r TG ) 1 /( w i +n2 ) satisfies 

0.5 < p s < 0.6 

for all N < 512. A similar analysis for the one-dimensional Helmholtz problem 

A u(x) — u"{x ) = f(x) (12) 

shows that with various choices of A and boundary conditions (Dirichlet, Neumann and mixed), an 
appropriate pointwise preconditioner also yields the smoothing factor per sweep //.. s < 0.6. 

We have developed FORTRAN-77 routines to implement the Chebyshev multigrid method 
using the pointwise preconditioner as described above. The code has been used to solve the 
problem (12) with various choices of u(x), A, and boundary conditions. The observed convergence 
factor per sweep /j, a is smaller than 0.60 for all cases tested, in agreement with the analysis 
presented above. 


The Two-Dimensional Case 


Formulation 

We note that Gauss-Seidel relaxation for the second-order centered finite difference 
approximation to (2) can be written as 


u i,k < u j,k + ^ 


where rj k is the finite-difference residual. A natural analogue for the Chebyshev collocation 
discretization (3) is 


Vj.k < — Vj,k + w 


.2 A? + 2 /h\ 




(13) 


where hj and hf. are the grid sizes at the point (x J} y^), fj^ := — [—(vj X ^ + vj^)] is the 

residual of Chebyshev discretization, and a; is a relaxation parameter to be chosen to accelerate 
the convergence. Clearly, the iteration (13) is a special case of the Richardson iteration (5), with 
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a diagonal preconditioner H with diagonal entries ( H)jkjk = ( 2 /^| + 2 /hf) l . This preconditioner 
is easy and fast to apply. Does it gives a fast convergence? Unfortunately, the following analysis 
shows that the answer is no. 


Analysis 

Computational results indicate that the eigenvalues of the matrix HA are all positive real 
numbers. Again, good approximations to the optimal u> and /i can be obtained by 


u> 


2 


Amax T A 


qua 


A 


^max ^qua 
^ in ax "1“ ^qua 


(14) 


where A max is the maximum eigenvalue and A qua is the quarter eigenvalue (ref. 1). More precise 
values of the optimal oo and ji, can be obtained by actually computing the spectral radius 
p(G(I - ujHA)) for different choices of u> and comparing the results. For N < 32, the eigenvalues 
Amax and Aqua, w and ji computed by (14) and the optimal u> and ji are listed in Table I. Since p is 
large and increases with N, these results suggest that the pointwise preconditioner (13) will not be 
a good multigrid smoother. 

Table I also lists the two-grid smoothing factors per sweep fi s = ( p (T)) 1/( " 1+n2) computed from 
the matrices in (11) for N < 32 using u = 0.36. These results again show that the pointwise 
preconditioning (13) does not give fast convergence. 

We have implemented the pointwise preconditioning (13) in a multigrid solver written in 
Fortran 77. Computational results from a number of test cases confirm the above analysis: we 
conclude that the pointwise preconditioning does not give fast convergence. 


Table I. Multigrid Analysis of Two-Dimensional Pointwise Preconditioning 


N 

Eigenvalues of HA 

By (14) 

By computation 

^max 

^qua 

LU 


^opt 

A 

Us 

4 

3.00 

1.83 

0.41 

0.24 

0.35 

0.28 

0.51 

8 

4.10 

1.26 

0.37 

0.53 

0.35 

0.52 

0.68 

16 

4.57 

0.95 

0.36 

0.66 

0.36 

0.75 

0.80 

32 

4.76 

0.78 

0.36 

0.72 

0.36 

0.82 

0.88 
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THE LINE RELAXATION METHOD 


The poor performance of pointwise preconditioning in two dimensions can be understood in 
terms of the anisotropy introduced by the nonuniform Chebyshev collocation grid. Since the mesh 
spacing varies with x and y, at any given point (x, y ) the coupling in the discrete operator in (3) 
may be stronger in x or in y. In finite-difference multigrid methods, point relaxation performs 
poorly in such anisotropic cases, and the cure is to use alternating direction line relaxation. Thus, 
it is reasonable to try an analogous approach for the Chebyshev discretization. 


Formulation 

To formulate the line relaxation method, we express the discrete problem (3) in the matrix form 

(H + V)U = F, (15) 

where H and V correspond to the horizontal part (—d 2 /dx 2 ) and vertical part (—d 2 /dy 2 ) of the 
Laplacian operator, respectively. Starting from an approximation y oId to the solution U, one sweep 
of (alternating direction) line relaxation based on (15) consists of the following two parts: 

1. Sweep along the x-direction. On each grid line parallel to x-axis, use the values of V old except 
those on the current line, and solve for values on the current line by solving (15). This can be 
expressed in the matrix form as 


(H + V d )V mid - F — V Q V old , (16) 

where V d and V a denote the diagonal and off-diagonal parts of the matrix V, respectively. Note 
that the entries of V d are known (ref. 1) and V d is a constant on each grid line parallel to the x- 
axis. Thus, the system (16) can be decoupled into (N — 1) one-dimensional discrete problems, 
each of which is a Chebyshev collocation approximation to a Helmholtz equation on an interior 
grid line parallel to x-axis; the x-directional sweep consists of solving these equations. 

2. Sweep along the y-direction. The y-direction sweep is basically the same as the x-direction 
sweep except that we now work on grid lines that are parallel to y-axis and use values of y mid 
instead of V <dd . The equation we need to solve is 

( Ud + V)F new = F — Ho V mid , (17) 

where H d and H a are the diagonal and off-diagonal parts of H. As in the x-direction sweep, the 
two-dimensional problem (17) is solved by solving (N - 1) one-dimensional Helmholtz equations. 

It turns out that as it stands, the line relaxation (16)— (17) is not a good multigrid smoother; 
however, this can be fixed as follows. Let C inid = y m,d — y™ d and C new = y new — y mid denote the 
corrections for T old and y mid , and R old = F - AV old and R m[d = F - AV mul denote the residuals 
of V old and y m,d , respectively. Rewriting equations (16) and (17) as correction equations and 
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introducing a relaxation parameter w (to be determined by analysis to accelerate the convergence), 
we obtain 

(H + V d )C m[d = u>R oli , {U d + V)C new = a >R mid (18) 

We refer to (18) as the collocation version of the line relaxation method. 

It is not practical to implement the collocation version because there are no fast solvers available 
for the collocation approximations, even for one-dimensional problems. However, in the multigrid 
context, a relaxation scheme functions as a smoother rather than a solver: instead of solving 
each problem exactly, we only need to smooth out the error, i.e., reduce high modes in the error. 
Therefore, it is reasonable to replace the one-dimensional problems in (18) by approximate versions 
which can be solved efficiently. We consider two alternatives as follows. 

In the first, we replace the collocation discretizations of the one-dimensional Helmholtz 
equations in (18) by tau discretizations. Tau approximations have the same exponential 
convergence as collocation method, but can be solved directly in 0(N log N) operations. This leads 
to the tau version of the line relaxation method, and the total work of one x or y-direction sweep is 
0(N 2 log N). As we will see below, this tau version turns out to be an efficient multigrid smoother. 

In the second, we replace the collocation discretizations of the one-dimensional Helmholtz 
equations in (18) by finite-difference discretizations. This leads to the finite-difference version of 
the line relaxation method, which has two obvious advantages over the tau version. First, it is 
faster because it eliminates the transforms required in tau version, thus reducing the operation 
count for solving each one-dimensional problem from O(NlogN) to O(N). Second, it can be 
extended to solve more generalized problems, e.g., problems with variable coefficients. As we will 
see below, this finite-difference version also turns out to be an efficient multigrid smoother, even in 
the case of variable coefficients. 


Analysis 


As in the case of the pointwise preconditioned Richardson relaxation, we can analyze the 
performance of the line relaxation methods described above by computing the eigenvalues of the 
corresponding interation matrices. Because the tau version cannot be expressed in matrix form 
like (18), we will only do the analysis for the collocation and finite-difference versions. Note that 
the tau and collocation versions are nearly the same, so the analysis for collocation version should 
give a good prediction for the performance of the tau version. In this section, we will give details of 
the analysis for finite-difference version and only list results for collocation version. 


Smoothing Analysis 

For the finite-difference version of the line relaxation iteration, the error evolution is described 

£ mid <— [/ - a >{n fd + V d )~ l (H + V)]£ old , (19) 

|^new 


by 


[/ - a>(' H d + V }d )-\U + V)]£ mid . 


( 20 ) 
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where "H^ d and V^ d are the finite-difference analogues of the collocation discretization matrices H 
and V, respectively. Therefore, the error evolution matrix for one relaxation is 

s = [i — u ,{u d + vf d r\n + V)][I - u,(H fd + v d )-\n + v)]. (21) 

The matrices S-u = (Mf d + V d )~ 1 (H + V) and S v = {U d : + V* d )-\U + V) have the same 
eigenvalues (since x and y can be interchanged in the Laplacian operator), so we can focus on 
just the x-direction sweep (19). The eigenvalues of Sy_ are all positive real numbers, so we can 
use formulas (14) to obtain approximate values of u > and p (squaring p to represent the effect of 
both the x and y sweeps). These values are listed in Table II for N < 32, along with the optimal 
relaxation parameter u and corresponding multigrid smoothing factor p = p(GS) computed 
directly. These results suggest that for large values of truncation number N , uj op t ~ 0.6 and 
p < 0.5, independent of the grid size. Corresponding results for the collocation version are listed 

in Table III. 


Multigrid Analysis 

For a multigrid V(ni, n 2 )-cycle, if we use zeros as initial guesses on all coarse grids (which is 
a natural choice because the coarse-grid solution is a correction to the solution on the next finer 
grid), then we can write out the error evolution matrix explicitly as 

M = S n2 [I -PGR{H + V)]S n K (22) 

This represents a procedure of n\ pre-relaxations (S™) followed by a coarse-grid-correction 
(/ _ p G R (H + V)) and then n 2 post-relaxations (S' 712 ). The matrix S is the error evolution 
matrix of one relaxation on the finest grid defined in (21). The central part I — PGR{T-L + V) 
represents the coarse-grid-correction, where R represents the fine-to-coarse grid transfer (we use 
injection) and P represents the coarse-to-fine grid transfer (we use Chebyshev interpolation). The 
matrix G is defined on the next coarser grid as follows: on the coarsest grid, G = {% + V) 1 (which 
means the coarsest grid problem is solved exactly); otherwise, 

G = [I - M]*{'H + V)~\ (23) 

which represents a multigrid solution procedure on that grid. Note that (23) is actually a recursive 
definition, since the matrix M in (23) includes another matrix G on the next coarser grid. 

Tables II and III also list computed values of smoothing factor per sweep p s = (p(M)) 1 ^ TI1+712 ^ 
for the case u = 0.6, ni = 2, and n 2 = 1. These results suggest that the smoothing factor of the 
line relaxation method is less than 0.5, independent of the grid size. Note that while we could also 
use Chebyshev restriction instead of injection for the fine-to-coarse grid transfer R, our numerical 
experience shows very little difference between these two choices. 
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Table II. Analysis of the Finite-Difference Version 


N 

Eigenvalues of S% 

By (14) 

By computation 

Vax 

^qua 

UJ 

V 

t^Opt 

V 

Vs 

4 

1.995 

1.000 

0.669 

0.110 

0.58 

0.110 

0.181 

8 

2.513 

1.000 

0.569 

0.186 

0.60 

0.168 

0.293 

16 

2.780 

0.995 

0.530 

0.224 

0.60 

0.271 

0.364 

32 

2.898 

0.815 

0.539 

0.315 

0.60 

0.366 

0.421 


Table III. Analysis of the Collocation Version 


N 

Eigenvalues of Su 

By (14) 

By computation 

^max 

^qua 

U) 

V 

^opt 

V 

Vs 

4 

1.651 

1.000 

0.754 

0.060 

0.68 

0.120 

0.302 

8 

2.322 

0.922 

0.616 

0.186 

0.60 

0.216 

0.328 

16 

2.701 

0.810 

0.570 

0.290 

0.58 

0.326 

0.380 

32 

2.869 

0.700 

0.560 

0.370 

0.60 

0.410 

0.428 


Computational Results 


We have implemented the tau and finite-difference versions of the line relaxation scheme 
described above in a Chebyshev collocation multigrid solver for the two-dimensional Helmholtz 
problem 

\u(x,y) - A u(x,y) = f(x,y), |x|,|y| < 1, 

u(x,y) = g(x,y), M = l,|y| = l, 

with various choices of /, g , and A. For both versions, the observed convergence factor per sweep is 
less than 0.5 for all cases tested, in agreement with the analysis above. The finite-difference version 
turns out to have slightly better convergence factors than the tau version, but the difference is 
minor. 
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Comparisons with Other Methods 


In this section we compare the line relaxation spectral multigrid method developed above to two 
other methods for solving the two-dimensional prototype problem (2). The first is a conventional 
finite-difference multigrid method; the second is a matrix diagonalization technique. We do not 
compare with the method of Zang et. al. (ref. 3) since the details presented in that paper were not 
enough to allow programing the method. All computations are done on a SUN SPARCstation2 
using double precision; the machine round-off error is about 2.22 x 10 -16 . 


Conventional Finite-Difference Multigrid Method 

The finite-difference discretization is the usual second-order five-point scheme on a uniform 
grid. The finite-difference multigrid method uses Gauss-Seidel (Red-Black) iteration as a relaxation 
scheme, the fine-to-coarse grid transfer is half-injection, the coarse-to-fine grid transfer is bilinear 
interpolation, and the multigrid V-cycle algorithm is used. 

According to computations, the average execution time of one V(2, l)-cycle of the finite- 
difference multigrid method is approximately (0.56 x 10 -4 ) N 2 seconds, and (0.21 x 10 -3 ) N 2 log 2 N 
seconds for line relaxation spectral multigrid method. Therefore, for the same grid sizes, one 
V(2, l)-cycle of the finite-difference multigrid method is approximately 3.75 log 2 N times faster 
than the line relaxation spectral multigrid method. 

However, because spectral methods have exponential convergence and finite-difference 
methods only have polynomial convergence, when high accuracy is required, finite-difference 
multigrid methods must use much bigger grid sizes than spectral methods. The result is that 
the line relaxation spectral multigrid method is faster than finite-difference when high accuracy 
is required. As a specific example, consider the prototype problem (2) with true solution 
u(x , y) — e 2x+y cos( 7 r(x + 4 y + 0.25)). The relation between accuracy and execution time required 
to achieve that accuracy is plotted in Figure 1 for both methods. We can see that when low 
accuracy is required, the finite-difference multigrid method is much faster than the line relaxation 
spectral multigrid method, but the situation is reversed when high accuracy is required. The 
crossover point for this problem is at an accuracy of about one percent error. The same conclusion 
would hold for finite-difference methods of higher (fixed) orders, although the crossover point 
would shift. Variable-order finite-difference methods could be expected to perform more like the 
spectral method, at a cost of considerable complexity. 


Matrix Diagonalization Technique 

The matrix diagonalization technique is introduced in (ref. 5) as a direct solver for the 
Chebyshev spectral approximation to the Poisson equation with Dirichlet boundary conditions. 
This technique requires a preprocessing step, which involves computing the eigenvalues and 
eigenvectors of a one- dimensional operator matrix ( 0(N 3 ) operations), and a solution step, which 
involves one-dimensional matrix multiplications (0(N S ) operations). 
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Xo compare execution times, we note that the line relaxation spectral multigrid method 
usually takes approximately 10 V-cycles to solve to the level of machine precision. Thus, Figure 2 
compares the execution time of 10 V-cycles of the line relaxation spectral multignd method with 
the execution time of solving the same problem directly by the matrix diagonahzation method 
(including the preprocessing step). These results show that the matrix diagonahzation method is 
quite fast for small grid sizes, but as the grid size grows, it becomes slower than the line relaxation 
spectral multigrid method. This is because the line relaxation spectral multigrid method is an 
0(N 2 log N) method, while the matrix diagonahzation method requires 0(N ) operations (even 
without the preprocessing step). 

The matrix diagonahzation technique is very efficient for problems with constant coefficients, 
especially when repeated solutions are required. However, this technique can only handle problems 
with constant coefficients. As shown below, the line relaxation spectral multigrid method is able to 
solve problems with non-constant coefficients. 
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Figure 2. Execution time: LR-SMG vs MD 


Extension to Problems With Variable Coefficients 


As a test problem with variable coefficients we consider 


- 5“ (»(*> »)) = /(*.»). 

where the coefficient functions and the true solution are 

o(z, y) = 6(x,y) = 1 + ee cos(/M*+y)) } 


M, \y\ < 1, 

M = i,|y| = l. 


( 24 ) 


( 25 ) 


u(x, y) = sin(a7rx + j) sin(c*7n/ + j). 


( 26 ) 


The parameter e measures how far the coefficients are away from the constant 1, j3 measures the 
oscillation of the coefficients, and a measures the oscillation of the solution. 
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Implementation of the Line Relaxation Spectral Multigrid Method 

The implementation of the finite-difference version of the line relaxation method is basically the 
same as for the constant coefficient case except for the following: 


1. On each grid line, the one-dimensional problem is not a Helmholtz equation anymore. For 
example, on a gird line y = % which is parallel to x-axis, we now solve a problem like 

~lL “ Vd(x,yk)v{x,y k ) = h(x,y k ) (27) 

by using a second-order finite-difference approximation on the Chebyshev grid. 

2. To compute values of Vd{xj, yjt), note that the interior equation in (24) can be rewiitten as 


d 2 u dadu ^d 2 u 

a dx 2 dx dx dy 1 


db du 
dy dy 


x\,\y\ < 1 


and the Chebyshev collocation approximation to (28) can be written as 


i-AV xx - A X V X } U - {-BVyy - ByVy) U=F , 


(28) 


(29) 


where A and B are diagonal matrices containing the values of the coefficients a(xj,y k ) and 
b{xj,yk), A x and B x are diagonal matrices containing the values of the derivatives -§^a{xj , y k ) 
and jj-b(xj,y k ) (which can be computed from values a(xj,y k ) and b(xj,y k )), and V x , V xx , 
V y , and V yy are Chebyshev differentiation matrices. Therefore, H = —AV XX - A X V X and 
V = -BVyy - ByVy\ generating the diagonal entries of H and V is straightforward. 


3. On coarse grids, we need to use so-called “filtered” coefficients a(x, y) and b(x , y) to formulate 
the coarse grid problems - , i.e., the coefficients a(x, y) and ft(x, y) are evaluated on the finest grid 
and then transferred to the coarser grids by Chebyshev restriction (ref. 3). 


Computational Results 

We have run the line relaxation spectral multigrid method for different values of parameters £, a 
and (3. For a = 1.0 and N x = N y - 32, the smoothing factor is graphed in Figure 3 as a function of 
e and a. Here we have chosen to measure the smoothing by the “smoothing factor per work unit 
defined by fj, w = (t^/t'i) 7 ’ 0 / 1 ’, where r\ and r<i are residual norms before and after one multigrid 
V-cycle, r is the execution time of one cycle and to is the execution time of one relaxation. These 
results show that for a wide range of £ and (3, the method converges relatively quickly. 

In (ref. 3) the same test problem (24) was solved using the Richardson relaxation (5) using 
two-dimensional finite-difference preconditioning; incomplete LU decomposition was used to 
approximately solve the finite difference approximation on the Chebyshev grid. With only limited 
details of the formulation and results of this method, it is difficult to make a complete comparison 
to the line relaxation method considered here. However, it appears that the line relaxation method 
gives convergence factors at least as small as those in (ref. 3); moreover, it is simpler. 
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Figure 3. Smoothing factors for problems with variable coefficients. 


CONCLUSIONS 


The pointwise preconditioning is simple and fast to apply. It is very efficient for one-dimensional 
problems. Unfortunately, it does not give fast multigrid convergence for two-dimensional problems. 

The line relaxation method provides a new approach to accelerate the multigrid Chebyshev 
spectral method for solving two-dimensional elliptic problems. It is efficient (yielding multigrid 
smoothing factors no larger than 0.5 per sweep) and inexpensive (requiring 0(N 2 log N) operations 
per sweep). -- i ‘■■jj -ire -7 r .':-..: 

When high accuracy is required, the spectral multigrid method using line relaxation is orders 
of magnitude faster than a conventional finite-difference multigrid method, due primarily to the 
exponential convergence of the spectral discretization. Compared to other methods for solving 
the discrete spectral equations, the line relaxation method also has advantages: it is comparable 
in efficiency to matrix diagonalization and finite-difference preconditioned Richardson relaxation, 
but can solve problems with variable coefficients which the former cannot, and is simpler than the 
latter. 
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SUMMARY 

We show that by certain transformations the boundary layer equations for the class of 
non-Newtonian fluids named pseudoplastic can be generalized in the form 

A u + p(x)u~ x = 0, x € ft C R n , n > 1 

under the classical conditions for steady flow over a semi-infinite flat plate. We provide a survey of 
the existence, uniqueness, and analyticity of the solutions for this problem. We also establish 
numerical solutions in one- and two-dimensional regions using multigrid methods. 

INTRODUCTION 


In the last two decades, solutions of the singular semilinear equation 

A u -I- p{x)u~ x = 0, x € fl C R n (1) 

have been extensively studied. Various existence and uniqueness results are given in [1], [2], and [3], 
to name a few. More recently, in [4], it is shown that by certain transformations the boundary layer 
equations for the class of non-Newtonian fluids named pseudoplastic can be generalized in the above 
form for the ODE case n = 1. Under this physical interpretation the above equation, considered in 
the context of partial differential equations (n > 1), has been the subject of much study. The 
equation has a unique classical solution with a bounded domain ft, where p(x) is a sufficiently 
regular function which is positive on D [5]. There exist entire solutions with A € (0, 1) for p(x) 
sufficiently regular ([6], [7]). This is generalized to all A > 0 via the upper and lower solution 
method ([8]) or other methods ([9]). 

‘This work was supported in part by the the Naval Postgraduate School Research Council under grant No. ZZ867- 
ZZ899/5986 
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The following sections provide a survey of both theoretical and numerical results in this area 
including a physical derivation [4], existence theorems for both the ODE and PDE cases with a 
proof of a main result [8], and our numerical results. We conclude with a discussion of a new 
technique and some open questions for further research. 

PRELIMINARIES 


du 

A non-Newtonian fluid is called pseudoplastic if the shear stress t and the strain rate -7— are 

dy 


related as 


|r| = k 


du 

dy 


, 0 < a < 1 


where k is a positive constant. That is, the absolute value of the shear stress increases with respect 
to the absolute value of the strain rate less than linearly. 

In this paper, we study solutions of the singular semihnear equation (1) where A > 0 and is a 
domain in R n , n > 1. In the following section we show that through a s er ies of transformations the 
boundary layer equations for the class of pseudoplastic fluids under the classical conditions for a 
steady flow over a semi-infinite fiat plate can be generalized into the well-known Blasius problem 


r + if = o, 


/(0) = /'(0)=0, /'( 00) = 1 


for the shear function, which arises from the standard Newtonian fluid case. 


DERIVATION OF THE PROBLEM 


For n = 1 equation (1) arises in the study of pseudo pl astic fluids. We consider a two-dimensional 
incompressible flow of low viscosity along a plane wall. We denote by v = (u, v ) as the fluid velocity 
in the boundary layer and u Xl (x) in the main stream. Since there is no velocity on the wall and the 

du , . 1 

fluid takes the velocity of the main stream it, x (x) outside the boundary layer, we see that — is 

dy 

large near the wall which causes a significant transfer of momentum in the x direction. 

The boundary layer equations for this model include a continuity equation and a momentum 
equation in the x direction. 


with boundary condtions 


du du 
dx dy 

du du 1 dT xy 
U dx + V dy p dy 


( 2 ) 
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•u(x,0) = w(x,0) = 0, 
u(x,oo) = u x (x) 


w 


u[x,y] 


B oundar y L ayer 


T T 

v{x,y) 

V 


Main Stream u 0 Cl (x) 


{ (u, v ) in boundary layer 

u. y (x) in main stream 


Figure 1: 2-D flow of low viscosity along a plane wall. 


where r iy = K 


du 


dy 


is the shear stress. 


Note that (2) has 2 coupled equations in 3 dependent variables, u, v, and r iy . To reduce it to a 
single higher-order equation in only 1 dependent variable, we introduce the Lagrange stream 
function <J>(x,y) such that 

d$ d$ 

U =dy' 

Then the momentum equation becomes 


<9$ #»$> d 2 d> _ d ,d 2 $ )a 

dy dxdy dx dy 2 dy dy 2 


( 3 ) 


where u = — , while the continuity equation is clearly satisfied by 4>. 

P 

Let / = Q< ^ ; r? = by\l — for some o, b. Then (3) becomes 

y/ufflvx V vx 

f" + /(/") 2 ~" = o ( 4 ) 


with 


m = /■( o) = o 

/'(») = i 


where / = /(q). (Observe that if a = 1 (4) is the well-known Blasius equation.) Employing the 
Crocco-like transformation 


u = /'(,), g( u) = «' = rw 


(4) becomes 


5°V' + (« - 1)0“ ^s') 2 + u ~ 0 


with </(0) = 0, 5(1) = 0, where g — g(u). Finally the transformation G = g a leads to the singular 
boundary value problem 

G" + ctuG~ l / a = 0, 0 < w, a < 1, 

G'(0) = G(l) = 0 

1 X 

of the form (el) with A=— , u = x, p = — . 

' a A 

EXISTENCE AND UNIQUENESS RESULTS 

In the first part of this section we study the results in finite and infinite domains; in the second 
we discuss methods that are commonly used to approach the problem. 


Theorems 


Let Cl be a bounded domain in R n , n > 1 with smooth boundary dCl (of class C 2+a , 0 < a < 1). 
Let p(x) be of C a ( Cl) and positive on A > 0. 

Theorem 1 (Lazer-McKenna [5]). The problem 

A u + p(x)u~ x =0, x e Cl 
_ujan= 0 

has a unique positive solution u(x) in Cl with u 6 G 2 +a (fi) H (Tt) . Furthermore let <f> be an 
eigenfunction corresponding to the smallest eigenvalue Ai of the problem 

A <j> + \(f> — 0, x E fi 
4 > |an= 0 

such that (f>\{x) > 0, x £ ft and A > 1 . Then there exists a unique 61,62 > 0 such that 

6i^i /(1+A) < u < 62^ /(1+a) 

on Cl. 

In the case Cl = R n , n > 1, we study the results under conditions n = 1, n = 2, n > 3 . Observe 
that if n = 1, since p,y > 0, y” + py~ x = 0 we have y" > 0 and thus y’ Hence 0 < y'(oo) < 00. 

Theorem 2 (Taliaferro [3]) The problem 


has a unique positive solution y(x) if 


where a, c e R l , a > 0. Furthermore 


y" + p(x)u~ x = 0 
y(c) = a 
2/(00) = 0 


x x p(x) dx < 00 

3/(00) < 00 if and only if f 0 “ xp(x) dx < 00. 
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The following theorem describes the asymptotic behaviors of the solution. 


Theorem 3 (Taliaferro [3]) 

• If 0 < y'(oo) < oo and x~ x+1 p(x)dx < oo, a, 6 > 0 then 

y(x) = ax + b - a-A(l + o(l)) J (f - x)£~ x p(£)d£. 

• If y'( oo) = 0 and f 0 ' xp(x)dx < oo , a > 0 then 

y(x) = a -a-A(l + °(1))^ (f - z)p(()d(. 


• if Pi q > o arc continuous on [0, oo) . lim x X: — R > 0 and 

z" +p(x)z~ x = 0, z'(oo) = 0; 
to" + g(x) , u;~ A = 0, w/(oo) = 0 

and fff' xp(x)dx = oo, then lim^^, w/z — RT^ . 

Theorem 4 ( Kusano-Swanson [7]). The problem 

A u — /(|x|)u~ A =0, x £ R 2 , 0 < A < 1 

has an entire positive solution in R 2 with logarithmic growth at oo if f{t) > 0, t > 0, 
f(t) £ (7(0, oo), and 

f t(\ogt)~ x f(t)dt < oo. 


A function u(x) is said to be an entire solution of (1) if u £ C? oc (RT) and u satisfies the equation 
pointwise in R n . 

Theorem 5 (Shaker [8]) The problem 

A u + p(x)u~ x — 0, x £ R n , A > 0 
has an entire positive solution u{x) such that 

Ci < ii(x)|x| 9 * n-2 ' < c 2 
for some cj, c 2 and 0 < q < 1 as x — > oo if 

1. p{x) £ C? 0l .(R n ), p(x) > 0 for x £ /T\{0}; 

2. there exists 0 < c< 1 such that c</>(|x|) < p(x) < <£(|x|) where <f>(t) = max| x | =t p(x). t £ [0, oo); 

3. ff t n-1+A ( n - 2 )0(£)dt < oo. 
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Methods 


In general there are two methods that are commonly used in proving existence and uniqueness of 
solutions for equations of type (1), namely Schauder’s fixed point thoernn and Barrier Methods. 
Since the former is standard we elaborate here only on the latter. 

Let 0 be a smoothly bounded domain in R n . fi(x) is said to be an upper (lower) solution of the 
problem 

A u + /(x, u) = 0, x G (5) 

u |an= 0 

if A fi + f(x,fi) < 0 , 160 , fi(x) > 0 xedSl(fi + f(x,fi) > 0 ,i 6 O, fi(x) <0i6 dQ). 

Theorem 6 (Sattinger [10]). Let fi\ be an upper solution and <f >2 be a lower solution oj (5), and let 
f be locally Holder continuous in ft. If fii(x) > 0 2 (^) in D, then (5) has a solution u such that 
fii{x) < w(x) < <j>\ (x). 16 0 . 

In the case when LI = R n we say fi is an upper (lower) solution of 

Au + /(x,u) = 0 ( 6 ) 

if A <j> + /(x, <f>) < 0 x G R n (for lower solution, Afi + /(x, <f>) >0). 


Theorem 7 (Ni [11]). Let fa and <f >2 be an upper and a lower solution of equation (6), such that 
<t>\ ^ <h x € R n If f i s locally Holder continuous in x and locally Lipschitz continuous in u, then 
(6) has a solution u with f> 2 {x) < u(x) < fi i(x), x G R n . 

An Example. 

Consider the problem 

u" + Au — u 3 = 0, x G (0, 7 r) 

U — 0, X = 0, 7T. 

It is easy to show that fii(x) = Rx ] H for some R large is an upper solution, and fin (x) = esin x for 
some e small is a lower solution of this problem. Clearly fi\(x) > fii{x) for x G [0, 7 r]. Thus by the 
above theorem there is a solution u(x) such that esinx < u(x) < Rx*, x G [0, 7r] . Since the problem 
is homogeneous we conclude that the problem has at least three solutions, namely, u, —u and the 
trivial solution. 


MULTIGRID SOLUTION OF THE PROBLEM 

In this section we present some numerical results for solving the problem 

A u + p(x) u~ x = 0 x 6 fl 
u(x) — 0 x G dLl. 
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Specifically, we describe Newton’s method for non-linear systems to solutions and multigrid 
V - cycle and FMV methods. We have implemented all of these methods for both the one- and 
two-dimensional cases, using (respectively) the unit interval and the unit square for fi. In each case 
we use a straightforward finite-difference dis cre tization, employing the standard second-order 
difference approximation for the second derivative operator. For the one-dimensional problem we 
define the grid of (N + 1) points x k = jh, for k = 0, 1, . . . N, where h is the mesh parameter 1/N . 
The second derivative operator is then approximated by 


d?u 

dx 2 


Xk 


h 2 


(7) 


where u k approximates u(x k ). For the non-linear term p(x)u(x) * we use the nodal values, p k u k ■ 
Since uq = u N = 0, this results in the non-linear system of equations 
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Letting u represent the vector of unknowns, we may write the system as Hu + g{u ) = 0, where H 
is the tridiagonal matrix and g is the non-linear vector function. 

For the two-dimensional case we take the tensor product of the ( N -I- l)-point grid in the x 
direction with an identical ( N + l)-point grid in the y direction, yielding an (N + l) 2 -point regular 
grid covering the unit square. The difference operator for the two-dimensional problem is 
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d 2 u 3V 
dx 2 + dy 2 j 


_ Mj-l.fc ~ 2Uj,fc + Uj+ lj c u j,k-l 2 U j,k + Ujjcj - 1 _j_ 


X j,k 


h 2 


h 2 


(9) 


Numbering the unknowns lexicographically by lines of constant y, we obtain the nonlinear system 
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where here Uj denotes the (N - l)-length vector of unknowns u J>k for k = 1, 2, . . . , N — 1 
corresponding to the grid-line in the y direction, and A and B are (IV 1 ) x (AT 1 ) matrices 
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The ( N — l)-length vectors w 3 contain the non-linear entries pj ik u ~ x k , for k = 1, 2, . . . , N — 1. Once 
again, we may write the system as Hu + g(u) = 0, where H is the block tri-diagonal matrix and g 
is the non-linear vector function containing the Wj's. 


Solution techniques 


The classical solution technique for (8) or (10) is to apply Newton’s method for non-linear 
systems. We write the system as F(u ) = 0, where F(u ) = Hu + g(u). Each step of the iteration 
is then given by 

u < — u — [J j p('u)] _1 F(u) 
where the Jacobian of the system is given by 

[Jf(u)} = H + D 

with H the linear part of F and D a diagonal matrix whose diagonal entries are the derivatives of 
the entries of g , for example —\p{x 3tk )uj k ^ x . 

Naturally, the Jacobian is not inverted at each step, but rather, we solve the system 
[Jp(u)]y = — F(u ) and then make the correction u <— u + y. We examined two methods for 
solving the system at each step, namely LU decomposition and a multigrid FMV cycle. 

Newton’s method converges quadratically. However, since each step involves inverting a system, 
it tends to be very slow. While the use of the FMV solver speeds the method up somewhat, it still 
is slower than the techniques we present next. It has long been known ([12], [13]) that on certain 
problems non-linear analogs to the classical Jacobi or Gauss-Seidel iteration methods could be 
employed with some success. Technically, one sweep of such a method means that for 
j = 1, 2, . . . , N — 1 (or ( N — l) 2 for the two-dimensional problem) one solves, via the scalar 
Newton’s method, the j th non-linear equation in the system F{u) = 0 for the j th unknown. As in 
the linear case, if the old values u are used throughout the sweep this is the Newton- Jacobi method, 
while if the updated values are employed as they become available it is the Newton-Gauss-Seidel 
method. In practice the j th equation is not actually solved, but rather, a few (one or two) steps of 
the scalar Newton’s method is performed on each equation in turn. 

The Newton-Jacobi and Newton-Gauss-Seidel iterations, however, typically behave in the same 
fashion that is observed in their linear counterparts. That is, the iteration generally progresses 
rapidly toward a solution with the first few sweeps, but then stalls out so that each additional sweep 
produces very little improvement. The reason behind this is the same as that seen in the linear case. 
The method stalls after the non-linear relaxation has successfully eliminated the oscillatory portion 
of the error, which it eliminates rapidly, but is unable to effectively treat the smooth portion of the 
error. This is precisely the difficulty that multigrid methods were devised to overcome. 

At the heart of multigrid is the coarse-grid correction [14]. Many common relaxation iterative 
relaxation methods for solving a Umar problem Au — f have the property that the relaxation 
effectively eliminates the high-frequency (oscillatory) components of the error but leave the low 
frequency (smooth) components essentially unaffected. However, because the error is smo oth after 
the relaxation, it may be represented accurately on a coarser grid, on which it also appears more 
oscillatory (relatively). Relaxation on this coarser grid then eliminates the oscillatory components 
of the coarse-grid error, which cannot be eliminated on the fine grid. The coarse-grid correction for 
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( 11 ) 


a linear problem may be written as 

u h <- P v u h + I 2h (A 2h )~ l I 2 h h (f h - A h P v u h ) 

where P is the relaxation matrix, v is the number of relaxations, I% h is a prolongation or 
intt rpolation matrix mapping coarse-grid vectors to the fine grid, I\ h is a restriction matrix 
mapping fine-grid vectors to the coarse grid, and A 2h is a coarse-grid version of the original matrix 
A. A crucial feature is that on the coarse grid Q 2h , the problem to be solved is the residual equation 
Ae = r, where the residual is defined r = f — Au and e is the error. That is, if u* is the exact 
solution, then Ae = A(u* — u) = f — Au = r. 

For nonlinear problems the residual equation doesn’t hold. Instead, we write the nonlinear 
equivalent of the residual equation, 

F(u + e) — F(u) = r. 

This equation is to be solved on the coarse grid, so we write 

F 2h (I 2 h h u h + e 2h ) - F 2h (I 2 h h u h ) = Il h (f h - F h (u h )), (12) 

or 

F 2h {u 2h ) = I 2h (f h -F h (u h ))+ F 2h (I 2h ). 

The coarse-grid correction is then performed by solving (12) for u 2h = I^u* 1 + e 2h , and then 
malt in g the correction u h <— u h + I 2h (u 2h — lj^u h ). This gives the full approximation scheme [15] 

u h ^ P v (u h ) + lZ h ((F 2h )-\ll h (f h - F h {P v {u h ))) + F 2h (I 2h P l '{u h ))) - I 2h P v (u h )), 
where P is a nonlinear relaxation scheme. 

For both the linear and nonlinear problems, the solution of the coarse-grid problem is computed 
using the same coarse-grid correction scheme as is being employed to solve the fine-grid problem. 
This leads to the multigrid V-cycle scheme, which (for the nonlinear problem using FAS) is 
described recursively as follows. 

u h ^ FASV h (u h J h ,v u V 2 ) 

1. Perform v-i non-linear relaxation sweeps times on F h (u h ) = f h with initial guess u h . 

2. If JV 1 is the coarsest grid, then go to 4. Else: 

f 2h = Il h {f h - F h {u h )) + F 2h (I 2 h h u h ) 
u 2h <— 0 

u 2 h FASV 2h (u 2h ,f 2h , v u v 2 ). 

3. Correct u h ^ u h + l£ h (u 2h - I 2 h h u h ). 

4. Perform v 2 non-linear relaxation sweeps times on F h (u h ) = f h with initial guess u h . 

An important consideration for this (or any) iterative method is the choice of a good initial 
guess. Clearly a better initial guess will reduce the overall effort required to obtain an acceptable 
solution. A standard approach in multigrid is to obtain a good initial guess by first solving the 
problem on a coarse grid, and then interpolating that solution to the fine-grid for use as an initial 
guess. Solving this coarse-grid problem, in turn, will be easier if an initial guess is obtained by first 
solving the problem on a still coarser grid. Applying this idea recursively leads the Full Multigrid 
FMG scheme, which (applied to the non-linear FASV scheme) may be described as follows: 

u h <- FASFMG h (u h ,u u u 2 ) 
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1. If il h is the coarsest grid, then go to 3. Else: 

f 2h = j2h(fh _ F h( u h^ + F 2h(j2h u h} 

u 2h *— 0 

u 2h <- FASFMG 2h (u 2h J 2h ,u u u 2 ). 

2. Correct u h <— u h + 

3. u h — FASV h {u h ,f h ,v u v 2 ). 


Numerical results for multigrid methods 


We have implemented the FASV using Newton-Jacobi and Red-black Newton-Gauss-Seidel 
iteration schemes. (Our implementation was in Matlab using vector arithmetic. We elected not to 
analyse Newton-Gauss-Seidel since it is not vectorizable. We did encode it, however, and found 
that the slowness of the for loops overwhelmed the speed of convergence.) Several different choices 
for A, p(x) and p(x, y ) were used, as were several sets of relaxation parameters. 

Table 1 gives some quantitative information regarding the performance of the method, 
comparing convergence rates for various choices of parameters. The results shown were obtained 
using the Red-black Newton-Gauss-Seidel relaxation. We find that for this problem we are able to 
obtain convergence rates that are similar to those obtained on the linear elliptic model problems for 
which multigrid is best known ([14], [16], [17]). Data for the one-dimensional problem are not 
shown, however, they are very similar to the two-dimensional case. 


Dimension 

p(x) 

A 

Fine-grid 

size 

Average V-cycle 
convergence factor 

2 

2xy 

2 

63 x 63 

0.051 



5 


0.050 



8 


0.078 

2 

2sin(27rx) sin(7ry) 

2 

63 x 63 

0.060 



5 


0.063 



8 


0.104 

2 

x/y 

2 

63 x 63 

0.059 



2 


0.060 



8 


0.086 


Table 1 

Additionally, we have implemented the FASFMG using Newton-Jacobi and Red-black 
Newton-Gauss-Seidel iteration schemes. Again, we find that the performance of the method is 
compatible with that found for FMG applied to the linear model problems ( [15], [17]). 

CONCLUSIONS 


Our survey of existence and uniqueness results has shown the problem 

A u 4- p(x) u~ x =0 x E Q 
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is guaranteed to have unique solutions under certain conditions, although these solutions will not be 
known in closed form. The problem arises in certain non-Newtonian fluids problems, so there is 
some interest in actually computing solutions. We have shown that for homogeneous Dirichlet 
boundary conditions on the unit interval and the unit square, multigrid methods appear to provide 
an efficient means of solution for reasonable choices of p(x). 

We note, however, that an actual convergence proof for the FAS method would be very difficult 
to obtain, in that such proofs normally require that we be able to decompose the space of grid 
functions into two operator-subspaces. Error components in one are annihilated by relaxation, 
while those in the other subspace are annihilated by coarse-grid correction. While such analysis is 
achieved for linear problems, non-linear problems generally can only be treated by linearization 
near a solution. In point of fact, the literature is remarkably sparse in the area of founding theory 
for the FAS method. 

A new technique, called multilevel projection methods (PML) has recently been introduced, [18] 
in an effort to provide a unifying, thematic approach to the design of a multilevel solver for a given 
problem. The main feature of PML methods is that the only basic choices that must be made 
concern the subspaces that will be used in relaxation and coarsening. All other components of the 
method, such as interlevel transfers, scaling, coarse-level problems, etc., are determined by 
projection between appropriate subspaces. In [18], several prototypical problems are developed to 
illustrate the principals involved. It now appears that the best hope of obtaining a strong founding 
theory for multilevel treatment of nonlinear problems may well be through careful and judicious 
application of PML, and our future research into solution methods for the problems we have 
discussed here will be aimed in that direction. 
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ABSTRACT 


There are two main requirements for practical simulation of unsteady flow at high 
Reynolds number: the algorithm must accurately propagate discontinuous flow fields 
without excessive artificial viscosity, and it must have some adaptive capability to 
concentrate computational effort where it is most needed. We satisfy the first of 
these requirements with a second-order Godunov method similar to those used for 
high-speed flows with shocks, and the second with a grid-based refinement scheme 
which avoids some of the drawbacks associated with unstructured meshes. 

These two features of our algorithm place certain constraints on the projection 
method used to enforce incompressibility. Velocities are cell-based, leading to a Lapla- 
cian stencil for the projection which decouples adjacent grid points. We discuss fea- 
tures of the multigrid and multilevel iteration schemes required for solution of the 
resulting decoupled problem. Variable-density flows require use of a modified projec- 
tion operator — we have found a multigrid method for this modified projection that 
successfully handles density jumps of thousands to one. Numerical results are shown 
for the 2D adaptive and 3D variable-density algorithms. 

INTRODUCTION 

The incompressible flow algorithm presented by Bell, Colella and Glaz [3] combines 
the original projection method of Chorin [9, 10] with the Godunov methodology 
developed by Colella [11] to yield a robust scheme which is second-order in both 
space and time. In [5] Bell and Marcus extend this method to handle flows involving 
spatial density variations. 

Originally developed for gas dynamics problems with strong shocks, the second- 
order Godunov technology gives the algorithm the ability to propagate discontinuous 

*This work was performed under the auspices of the U.S. Department of Energy by the Lawrence 
Livermore National Laboratory under contract No. W-7405-Eng-48. Support was provided by the 
Applied Mathematical Sciences Program of the Office of Energy Research under contract No. W- 
7405-Eng-48, and by the Defense Nuclear Agency under IACRO 93-817. 
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flow fields or density jumps without introducing nonphysical oscillations, violating 
conservation laws, or employing unnecessary dissipation. The resulting schemes are 
therefore appropriate for studying unsteady flows with little or no viscosity. The 
projection portion of the algorithm enforces incompressibility without the need for 
an artificial pressure boundary condition. 

The most natural discretization for Godunov methods involves storing all velocity 
components at the centers of grid cells. Node-based variants are not difficult to 
obtain, but the requirement that all components be stored at the same points is a 
fairly strong one. Formulations of the projection using the staggered grid system of 
Harlow and Welsh [13] are thus largely incompatible with the Godunov approach. Use 
of collocated velocities, however, leads to unusual difference stencils for the projection 
which decouple adjacent grid cells. 

We have developed extensions to the algorithms of [3] and [5], the most important 
of which are a reformulation of the methods on an adaptive hierarchy of grids, and 
the use of multigrid and multilevel iteration techniques to speed up computation of 
the projection. While we have made some attempt to keep separate the questions of 
how to formulate the projection versus how to solve it, there has inevitably been some 
interplay between these two halves of the problem. The decoupled difference stencils 
used by the projection in uniform parts of the grid place certain requirements on the 
multigrid scheme, while the need for efficient convergence of the multilevel iteration 
influences the choice of derivative stencils across coarse-fine grid interfaces. 

These issues, concerning the formulation of the projection and its solution via 
multigrid methods, are the primary concern of this paper. Most of this material 
is new, though the need for a decoupled multignd stencil was discussed briefly in 
[4]. The detailed formulation of the Godunov module, methods for error estimation 
and regridding, and the addition of viscous terms to the equations are all discussed 
in another paper, currently in preparation. These subjects will therefore be given 
only the most cursory attention in the present work. We will, however, describe 
the time-stepping procedure, so as to place the projection in its proper context as a 
component of the algorithm. This will be part of the general overview given in the 
next section. The section after that discusses the multigrid projection, while the final 
section presents some examples and numerical results. 

OVERVIEW OF THE METHOD 


The equations we are attempting to solve are the incompressible Euler equations 
with finite- amplitude density variation, 

Vp 

U t + (U-V)U = -y, 

Pt + (U ■ V)p = 0, 

V-U = 0, 
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( 1 ) 

( 2 ) 

(3) 


where U represents the velocity field, p represents the hydrodynamic pressure and p 
represents the local mass density. We will denote the x and y components of velocity 
by u and v, respectively. 

The range of density variation in a problem may be moderate, as in the case of two 
or more different gases mixing in a combustion chamber, or may be relatively large, as 
in the 800-to-l density jump at a water-air interface. Of course, many flows of interest 
do not involve density variations at all — for these problems (2) may be discarded, or 
similar equations may be used to advect passive quantities which do not affect the flow 
field. (Our implementation of the adaptive scheme currently handles only constant- 
density flows.) Flows with very small density variations are an intermediate case, as 
they may not require the full variable-density formulation. As described in [5], these 
flows may be modeled using what amounts to a constant-density projection method 
with a Boussinesq forcing term added to (1). 

From a computational point of view the most problematic term in (1-3) is the 
pressure gradient. In contrast to the compressible case, pressure in incompressible flow 
plays no thermodynamic role, and cannot be determined from an equation of state. Its 
only function in the equations is to indirectly enforce the incompressibility constraint 
(3). The essential idea of projection methods is to eliminate the pressure entirely, 
by use of an operator which projects the velocity U onto the space of divergence-free 
vector fields. 

The theory behind the projection operator is based on the Hodge decomposition, 
which provides that any vector field V can be decomposed into a divergence-free 
component V d and the gradient of some scalar <j>. This decomposition can be made 
unique through imposition of appropriate boundary conditions, e.g., no flow through 
boundaries. It is also orthogonal, since divergence and gradient are skew-adjoint with 
respect to the usual inner products on scalar and vector fields. 

Given operators D for divergence and G for gradient, either continuous or discrete, 
a projection onto the space of divergence-free fields can be written as 

P = / - G(DG)~ 1 D. (4) 

(The numerical inversion of DG takes the place of solving the “pressure Poisson 
equation” that often appears in incompressible flow algorithms.) A modification 
of this projection is required for variable-density flows. We want to decompose a 
field into a divergence- free component and 1/p times the gradient of a scalar. The 
appropriate form is 

= I - aG(DaG)~ 1 D, (5) 

where a = 1/p and orthogonality is now with respect to a p-weighted inner product. 
In terms of this weighted projection, (1) can be written as 


Ut = V <T \{-U -V)U). (6) 

To obtain a second-order temporal discretization of this equation (and (2)), we 
use a fractional step process. First, the Godunov advection procedure is used to 


compute ( U ■ V)(7 and (U • V)p at the n + x j-i time level. The density equation can 
then be advanced immediately, while the projection is applied to ( U ■ V)U n+ to give 
a divergence- free approximation to U t : 


p n+1 - p n 

At 

U n+ 1 - U n 
At 


-(C/- V)p n+I/ -’, 

P* [-(17 ■ V)U n+ '^] . 


(7) 

( 8 ) 


Since the p equation can be advanced first, p n+ ^ 2 is available for use in the projection. 
The Godunov method uses (1 / p)Vp n ~^ 7 to approximate the effect of the incompress- 
ibility constraint on Ut, the projection in (8) then yields an updated approximation 
to (1 / p)Vp n+i h to be used at the next time step. 

We will not go into detail on the internal workings of the Godunov procedure here. 
Suffice it to say that using approximations to time derivatives and limited slopes (U x , 
etc.) at cell centers at time n, U and p are extrapolated to cell edges (faces in 3D) at 
time n 4- x ji. Upwinding rules resolve the choices between values coming from either 
side of an edge, then these edge values are differenced to yield the (17 • V) terms 
at cell centers at time n + y 2 . The detailed procedure we use is very similar to that 
described in [3], with the variable-density enhancements given in [5], and an improved 
treatment of the transverse derivative terms (vU y , etc.) as described in [4]. 

For a more thorough discussion of the Hodge decomposition, the incompressible 
Godunov algorithm, and the time-stepping procedure, we refer the reader to [3] and 
[5]. These papers deal exclusively with the single-grid case, but the adaptive case 
requires no changes to the time-stepping method and only minimal modification to 
the Godunov method, e.g., interpolation into ghost cells around the edges of fine 
grids. An adaptive Godunov method for gas dynamics that is similar to our approach 
is described in [7] . We describe the adaptive projection at the end of the next sec- 
tion; other aspects of our adaptive incompressible algorithm will be addressed in a 
forthcoming paper. 


MULTIGRID PROJECTION 

We now discuss a multigrid algorithm for computing the variable-density projec- 
tion (5). For simplicity we restrict the notation to two dimensions, but the methods 
presented are immediately extensible to 3D. A three-dimensional flow example is 
included in the following section. 

Given appropriate divergence and gradient stencils, a projection of the form (5) 
will yield a velocity field which is discretely divergence-free to the limit imposed by 
roundoff error. The projection will therefore be idempotent, i.e., repeated application 
will not further modify the projected vector field. This is a valuable property for an 
unsteady flow algorithm since the projection will be applied at every time step. If 
D = —G T then the projection will also be orthogonal, yielding the nearest — in a 
p - weighted sense — divergence-free field. 
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Figure 1: Decoupled grid structure: DaG(f> at cells marked V depends on (f> at V 
cells, a at cells for x-differences, and a at ‘|’ cells for y-differences. Residuals 
from V cells are restricted by averaging to the cells marked with boxes on the next 
coarser grid. For purposes of restriction and interpolation, these coarse and fine values 
behave as if they were located at the points indicated by the arrows, rather than at 
the centers of their respective cells. 


The simplest choice is to use centered differences for both divergence and gradient: 


m)ij 

= ^ Ui+hi ~ + ~K^( Vi ' j+1 ~ 

(9) 


= (/Vr^ 1+1,J ~~ Ay — • 

(10) 

Composition of these then yields the elliptic stencil 


(DaG<t>) it j = 

(Ax) 2 ^ + 



(Ay) 2 — <f>% ,j)] 

(11) 


which appears in the projection. The main calculation we have to perform is the 
inversion of this expression — we have to solve DcrG<f> = DV for <f> given an input 
vector field V . Boundary conditions for <j> are determined by those for the velocity 
field. Slip walls (inviscid flow) yield Neumann boundary conditions for <f>, while in 
periodic problems all quantities are, naturally, periodic. Though the linear system is 
singular, solvability is provided by the special structure of the problem: if D = —G T , 
then the range of G is orthogonal to the null space of D ; therefore, any field in the 
range of D is also in the range of DaG. 

Ignoring the <x’s for the moment, we see that (11) looks like a stretched version of 
the familiar 5-point stencil for the Laplacian. The difference is that (11) provides for 
no communication between adjacent grid points. Except for the effect of boundary 
conditions, four distinct sets of grid points participate in four distinct linear systems. 
Grids couple in pairs at wall boundaries, but the only local coupling comes from the 
smoothness of the right hand side DV. Figure 1 illustrates the decoupling pattern, 
including the role of the cr’s. 



However smooth the initial right hand side, later residuals in a multigrid scheme 
tend to have significant components at all wavenumbers. Multigrid depends on the 
fact that a solution to a coarsened system provides a good approximation to the 
desired fine solution. It is not surprising, therefore, that every experiment we have 
tried where the coarsening procedure combined components from decoupled grids 
proved to be wildly divergent. On the other hand, coarsening schemes which respect 
the decoupling lead to systems analogous to those arising from the usual 5 -point 
Laplacian, for which multigrid is quite effective. 

Let us define transformations between coarse and fine index spaces as follows, 

I = 2 • [i/4j + i mod 2, (12) 

i = 4 • | 7 / 2 J + J mod 2 ( 13 ) 

and similarly for J, j. Capitals denote indices on the coarse grid, lower case on the fine 

grid, and [ J reduces its argument to the next lower (or equal) integer. Each coarse 

point ( J, J) then has four fine points associated with it: (i,j), (®»J + 2 ), (i + 2 , j) , 
(i + 2,7 + 2). These fine points do not appear to be quite centered around the coarse 
point, which would complicate restriction and interpolation formulas. We observe, 
however, that a centered pattern results if the points in question are each shifted to 
the center of their local 2 x 2 blocks, as illustrated in Figure 1 . This shifting does not 
change the spatial relationship of any coupled points, even at the boundary, so for 
multigrid purposes we can treat each coarse point as if it were centered among its 
four associated fine points. 

The simplest restriction formula gives a coarse cell the average of the values from 
its associated fine cells, while the simplest interpolation formula distributes the coarse 
value to each of the four fine cells (piecewise-constant interpolation). There are both 
theoretical results and experiments, discussed in [ 17 ], which suggest that for second- 
degree problems at least one of these must be replaced by a higher-order formula in 
order to give satisfactory convergence rates. Our own experience does not bear out 
this assertion. However, for difficult problems involving large density jumps we have 
observed an improvement in robustness from use of a bilinear stencil for interpolation, 

(f>ij = ^( 90 /,J + 301-2, .7 + 30/,J_2 + 07-2, j-2) ( 14 ) 

and similarly for 0 iJ+ 2 , etc. A smaller improvement resulted from the opposite choice, 
bilinear restriction with piecewise-constant interpolation. Problems without difficult 
density configurations did not show a consistent improvement in convergence rate 
with either stencil. We use ( 14 ) routinely in our variable-density code, but use the 
piecewise-constant formula in the constant-density adaptive code. Restriction is by 
simple averaging in both cases. 

We have now satisfactorily dealt with the decoupling problem for 0 , but what 
about <7, i.e., how to we form the elliptic stencil on coarser grids? It is apparent 
from Figure 1 that a values do not occupy the same decoupled component of the grid 
as 0 and the residuals. Moreover, a values used for x-differences are on a different 
component from those used for y-differences. 
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One possibility is to redefine the problem to place a's at the same points as </>’s: 


(DaG<fr)ij 


2 (Ax) 2 ^ Cri_2,j + “ ^».i) + (ori+2,j + 0ij)(0i+2,i - + 

2(Ay) 2 ^ <r ’ ,i_2 + ai b)(^tJ-2 - 0i,j) + (o'i,i +2 + ^i,j)(0i,i+2 - (15) 


The hope is that a could be coarsened by averaging over associated cells, just as 0 is. 
Unfortunately, this scheme gives somewhat degraded accuracy, and more importantly, 
horrible multigrid convergence rates for problems with large density variations. 

The convergence rate of the multigrid cycle seems more strongly dependent on 
the proper coarsening pattern for a than on any other single feature of the method. 
The following procedure is in fact the only scheme we have tried that gave anything 
approaching satisfactory results. We keep two different arrays of a values on coarser 
grids, one for x-differences and one for y-differences. These are coarsened as follows: 


<7/,j = 2 ). 


2j')» 


(16) 


where i' = 27 + I mod 2, f = 2 J + J mod 2 and a x = cr y = a on the fine grid. 
Coarse stencils based on (11) and formed with these values perform well even in the 
presence of sharp density interfaces. They only begin to fail when presented with 
such nonphysical effects as large sawtooth variations in the density field. 

One common approach to deriving coarse grid equations is to use the form RAP, 
where R is the restriction operator, A is the elliptic stencil, and P is the interpolation 
operator. Unfortunately, this approach does not give a usable stencil when applied 
with piecewise-constant formulas for R and P, and higher-order transfer stencils give 
rise to larger, more complicated coarse grid operators. Use of (16) can be motivated 
in two ways, however. First, patterns like this one do appear in the RAP stencils, 
even though those formulas have other drawbacks. Second, if we confine our attention 
to one decoupled component of the grid, the a locations can be interpreted as the 
edges between its cells. An analogy to a diffusion problem with <f> as heat content and 
<7 as conductivity then suggests an averaging along edges equivalent to (16). 

A detailed discussion of multigrid for problems with difficult coefficients can be 
found in [1] . Our approach seems adequate for configurations likely to arise in practi- 
cal projection problems, however, and the authors of [1] acknowledge certain patho- 
logical cases where even their more complicated schemes will fail. 

For our multigrid schedule we use the pattern called FMV in [8] — the F-cycle in 
[17] — with smoothing by point Gauss-Seidel. Two smoothing steps before each grid 
transfer operation, up or down, seems to give the best performance. In problems with 
large density variations the Gauss-Seidel method alone does not give rapid conver- 
gence on the coarsest grid, so we have replaced it at that level with an exact solver. 
A direct method could be used here, but we have found it more convenient to employ 
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Figure 2: Examples of decoupled derivative stencils across a coarse-fine interface. The 
crosses indicate a fine cell (left) and a coarse cell (right) at which y-derivatives are 
evaluated. Bullets show which cells participate in the stencils. In each case, values 
on the opposite side of the interface axe interpolated in the transverse direction to 
the circled points, giving three values on a line normal to the interface from which 
the derivative can be computed. 

a simple diagonally-preconditioned conjugate gradient method based on algorithm 
10.3-1 and equation 10.3-3 from [12]. The conjugate gradient approach has the ad- 
vantage in that it neither requires explicit storage of a matrix, nor special treatment 
of the singular linear system. 

This completes our description of the variable-density multigrid projection. One 
variation should be noted in passing. To reformulate the 2D projection in cylindrical 
{r-z) coordinates, it suffices to redefine a as x/p, where x = r becomes the radial 
coordinate. No other change is required in the projection portion of the algorithm. 


An adaptive version of the projection method can be described, at least roughly, 
in terms of a few relatively minor additions to the single-grid algorithm. The details 
of the implementation, however, are considerably more complicated, and we only 
have a working program for the 2D constant-density flow case. Our purpose here is 
not to give a step-by-step breakdown of the entire adaptive procedure, but rather 
to highlight the ways in which a decoupled Laplacian stencil affects the multilevel 
projection calculation. For the sake of brevity, we have decided not to burden this 
discussion with explicit formulas — we trust that all necessary expressions can be easily 
derived from the descriptions given in the text. 

The structure of the grid hierarchy is similar to that used in [7]. A single rectangu- 
lar grid covers the entire computational domain at the coarsest level. In “interesting” 
regions of the flow, finer grid patches are laid down, refined from the coarse level by a 
fixed ratio r. These finer grids are themselves rectangular, both to minimize program 
overhead and to improve performance on vector architectures. If necessary, more lev- 
els of grids can be created, but we impose a “proper-nesting” requirement that each 
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refined level l have a border of cells at level l - 1 separating it from still coarser levels. 
The simplest choice for a refinement ratio is 2, but we often use 4 instead in order to 
reduce both the number of refined levels and the amount of wasted storage allocated 
to coarse grids underlying fine grids. 

In contrast to approaches like that of [15], we have maintained a logical separation 
between the multilevel iteration for the adaptive scheme, and the multigrid solvers 
on individual grids. Our multilevel iteration proceeds as follows, where we assume 
familiarity with the residual-correction formulation discussed in [8] and [17]: 

- Start with an initial approximation to 0, either 0 or the value obtained at the 

previous time step. 

- Repeat until residuals satisfy tolerance: 

- Compute residual on all grids, including coarse-fine interfaces. 

- Restrict residuals from fine to coarse grids. 

- Set correction array to 0 at coarse level. 

- For each level l, from coarse to fine, do: 

- Execute FMV cycle for residual equation on each grid of level l , using 

values from adjacent grids as boundary conditions if necessary. 

- Add correction into cf> at level l. 

- Interpolate correction to next finer level, if any. 

The convergence properties of this method depend on a coarse grid solution being 
a satisfactory approximation to the solution on the composite grid. In order for this 
to be the case, all interpolation, restriction, and difference stencils have to respect 
the decoupling pattern. For the grid transfer operations, these formulas are like those 
we have already discussed. Restriction is by simple averaging of associated cells. For 
interpolation we have had best results with a higher-order method, a biquadratic 
formula using coarse cells from the appropriate decoupled grid component. Unlike 
the single-grid case, effective position shifts like those shown in Figure 1 are no longer 
valid, so we use the actual positions of cell centers to derive the interpolation stencil. 

Difference formulas across the grid interfaces are more problematic. Whereas 
restriction and interpolation schemes affect only the convergence rate of the iteration, 
the difference stencils determine the actual converged solution. Stencil outlines for 
both fine and coarse points near the interface are shown in Figure 2. In both cases we 
use quadratic interpolation to obtain third-order accurate values on the opposite side 
of the interface, then a three-point difference formula to give a second-order accurate 
derivative at the desired point. Composition of second-order derivatives in D and 
G gives a Laplacian approximation that is first-order accurate along the interface, 
sufficient for global second-order accuracy of the projected velocity field. 

These derivative stencils are used for computing residuals and for obtaining di- 
vergence and gradient in the projection formula. Note that D is no longer equal to 
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32 64 128 256 


3.54926 8 

0.907518 (3.91) 7 

0.228655 (3.97) 7 

0.0573795 (3.98) 7 

4.16279 8 

1.14104 (3.65) 7 

0.293781 (3.88) 7 

0.074023 (3.97) 7 

9.84036 13 

4.7626 (2.07) 9 

2.14014 (2.23) 11 

0.876724 (2.44) 18 

1.29866 19 

0.378418 (3.43) 16 

0.097241 (3.89) 15 

0.024058 (4.04) 14 

7.26845 19 
0.802074 

1.84558 (3.94) 20 
0.196431 (4.08) 

0.476259 (3.88) 21 
0.0487401 (4.03) 

0.123554 (3.85) 22 
0.0121474 (4.01) 


Table 1: Convergence results for both variable-density and adaptive implementations 
of the decoupled projection. For each case the problem was run with square base 
grids of four different sizes — 32x32 through 256x256 — to a final residual less than 
10 _nr . The numbers given for each run are the final oo-norm error in the velocity 
field (times 1000), the factor of improvement from the next coarser grid, and the 
number of multigrid cycles required. For the last run (adaptive code), 2-norm error 
data is also given. A description of each problem is given in the text. 


— G 7 ’. This means that the adaptive projection is no longer quite orthogonal, and we 
have to add a slight correction to DV to make the system solvable. The alternative, 
however, would be to use less accurate stencils for either D or G at the interface, 
which would seriously degrade the performance of the algorithm. 

NUMERICAL EXAMPLES 


Table 1 summarizes the convergence behavior of the projection for five different 
problems. The domain is the unit square with no flow through the boundaries. In 
each case we start with the divergence-free vector field 

u = (+0.2)(x + l)(7r(y + 1) cos Try + sin Try} sin 7rx, 
v — (— 0.2)(y 4- 1)(tt(x + 1) cos ttx + sinTrx) sinTry, (17) 


add to it 1/p times the gradient of 



cos 


7T 




COS 7T y y 


(18) 


then apply the projection. This should strip off the gradient portion of each field and 
return the divergence- free portion (17). The five cases considered are: (1) constant 
density, (2) mild density variation — p = 1 4- 100 sin 2 ttx sin 2 Try, (3) extreme density 
variation — p = 1 + 100000 sin 2 nx sin 2 Try, (4) discontinuous jump in density — p — 1 
inside a radius 0.1 circle centered at (0.4, 0.4), p = 10001 elsewhere, (5) constant 
density adaptive — the square from 0.25 to 0.75 in x and y is refined by a factor of 
four from the base grid. 


Cases (1) and (2) are smooth, so the multigrid scheme converges rapidly and gives 
unambiguous second-order convergence. Cases (3) and (4) are more difficult, but the 
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scheme is still clearly better than first order. In the adaptive case (5) the errors are 
concentrated along the coarse-fine grid interface, where the discretization of DG is 
only first-order accurate. Convergence is still second-order in the 2-norm, but may 
be slightly degraded in the oo-norm. Note that this example is not representative 
of the intended use of the adaptive method. In normal operation the interfaces are 
well-separated from complicated regions of the flow field, which dominate the error 
behavior of the scheme. Slower convergence for the adaptive scheme appears to be 
due the mismatch between coarse grid stencils and the residuals computed at the 
interface. Relaxation at interfaces and/or closer integration of the multigrid and 
multilevel iterations might yield a faster algorithm. 

Quantitative analysis of the the flow solver as a whole is beyond the scope of this 
paper. Our remaining two examples are intended mainly as illustrations, to demon- 
strate the power of the algorithm for modeling unsteady flow fields with finely detailed 
structure. In Figure 3 we show an image from a 3D variable-density calculation set 
up and run by Dan Marcus. A bubble of helium was initially started at rest near 
the bottom of the domain. The ambient fluid is air, giving a density ratio of 7.25. 
The calculation was performed on a 64x64x128 grid occupying one quarter of the 
volume shown — this was filled out to 128 3 for rendering by reflection through the two 
symmetry planes. At the time of the picture the bubble has risen and developed into 
a torus, with more complicated flow patterns visible in the outer mixed region. We 
do not claim that this calculation accurately models a turbulent flow field. However, 
a more detailed examination of transition to turbulence, using a projection method 
similar to the one presented here, can be found in [6]. 

Figure 4 illustrates the adaptive algorithm. A 64x64 base grid is refined twice, 
by a factor of four each time, so the finest level has resolution equivalent to a single 
1024x1024 grid. Eveiy 10 time steps grids are re-allocated according to a procedure 
based on second derivatives of the velocity field. In the initial conditions, four patches 
of vorticity with radii 0.025 are placed in the unit square at (0.5, 0.5), (0.5, 0.575), 
and the two 120° rotations of this position. Each patch has uniform vorticity except 
for a linear ramp 3/256 wide down to zero vorticity at the edge — the radius of the 
patch is the distance from the center to the halfway point of the ramp. The initial 
velocity field is obtained by solving for the stream function associated with the given 
vorticity field. This is identical to the projection calculation, except that the stream 
function satisfies a Dirichlet boundary condition. Note how well the Godunov advec- 
tion scheme preserves fine details of the flow field, even in the highly stretched regions 
near the vortex core. 


CONCLUSIONS AND FUTURE PLANS 


Centered difference stencils are the simplest choice for implementing the discrete 
divergence and gradient, subject to the requirement that velocity components must 
all be defined at the same points. The decoupled projection stencils arising from this 
choice require various contortions in the solution algorithm, which raises doubts as to 
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Figure 3: Volume- rendering of a helium bubble rising through air. The central part 
of the bubble has taken on a simple toroidal shape, but the outlying mixed regions 
show more complicated flow patterns. 
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Figure 4: Adaptive simulation of a four-way vortex merger problem, showing contours 
of vorticity. 


the practical utility of the results thus obtained. Despite the unusual behavior of the 
projection, however, the difficulties have been overcome and the method successfully 
models a variety of incompressible flow problems. 

It seems likely that some flow problems will not be suitable for this type of algo- 
rithm. Though the projection does not directly cause high-wavenumber instabilities, 
neither does it do anything to suppress them when the}' are excited by other parts of 
a flow solver. Lai, for example, reports having difficulty using this type of projection 




for certain combustion problems [14]. We have seen stability problems ourselves in an 
adaptive version of the algorithm of [4], where a staggered-mesh projection is applied 
to the edge velocities computed in the Godunov predictor. 

While we believe the decoupled method is a worthy contender, these difficulties 
beg for comparative studies with other types of projections. One alternative is the 
regularization given by Strikwerda [16] . Though coupled, however, the stencils derived 
in this work are both large and asymmetrical. A newer approach is that of Almgren, 
Bell and Szymczak in [2], which is coupled and symmetrical but not quite idempotent. 
We have recently completed an adaptive version of this projection, early results from 
which seem quite promising. 
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SUMMARY 

We present ideas on how to use wavelets in the solution of boundary value ordinary differential 
equations. Rather than using classical wavelets, we adapt their construction so that they become 
(bi)orthogonal with respect to the inner product defined by the operator. The stiffness matrix in a 
Galerkin method then becomes diagonal and can thus be trivially inverted. We show how one can 
construct an O(N) algorithm for various constant and variable coefficient operators. 

INTRODUCTION 

The purpose of this paper is to use wavelets in the solution of certain linear ordinary differential 
equations of the form 

771 

Lu{x) = f(x) for a: 6 [0,1], where L = a ? (x) D\ 

j— o 

and with appropriate boundary conditions on u(x ) for x — 0, 1. 

Currently there exist two major solution techniques. First, if the coefficients a,j (x) of the 
operator are constants, then the Fourier transform is well suited for solving these equations. The 
underlying reason is that the complex exponentials are eigenfunctions of a constant coefficient 
operator and they form an orthogonal system. As a result the operator becomes diagonal in the 

’The first author is partially supported by DARPA Grant AFOSR 89-0455 and ONR Grant N00014-90-J-1343, the 
second author is Research Assistant of the National Fund of Scientific Research Belgium and partially supported by 
ONR Grant N00014-90-J-1343. 
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Fourier basis and can thus trivially be inverted. The numerical algorithm then boils down to 
calculating the discrete Fourier transform of the right hand side, dividing each coefficient by its 
corresponding entry in a diagonal matrix and finally taking the inverse Fourier transform to obtain 
the solution. This can be done quickly using the fast Fourier transform which has a complexity of 
N log N, where N is the number of unknowns in the discretization. 

If the coefficients are not constant one typically uses finite element or finite difference methods to 
discretize the problem. We focus here on finite element methods. Define the oprrtiior inner product 
associated with an operator L by 

((u,v)) = ( Lu, v ) . 

A weak solution u can be found with a Petrov-Galerkin method, i.e. consider two spaces S and S 
and look for a solution u E S such that 


((«.»)) = (/. v >» 

for all v in S*. If S and S' are finite dimensional spaces with the same dimension, this leads to a 
linear system of equations. The matrix of this system, also referred to as the utiffnrs# matrix, has as 
elements the operator inner products of the basis functions of «S and S . 

Traditionally one uses very local finite elements such that the stiffness matrix has a banded 
structure. The linear system can then be solved efficiently with an iterative method. These classical 
finite elements however have the disadvantage that the stiffness matrix becomes ill conditioned as 
the problem size grows. This slows down the convergence speed of the iterative algorithm 
dramatically. It is well understood by now that this can be solved with multiresolution techniques 
such as multigrid or hierarchical basis functions [1, 2]. Multiresolution finite element bases can 
provide preconditioners which result in a uniformly bounded condition number, see e.g. [3, 4, 5]. 
The convergence of the linear system is then independent of the problem size. 

The research presented here is motived by the question of how good wavelets are for the solution 
of ordinary differential equations. We know that there are basically four main properties of 
wavelets; namely, they provide a multiresolution basis for a wide variety of function spaces, they are 
local in both space and frequency, they satisfy (bi) orthogonality conditions and fast transform 
algorithms are available. Because of these properties, wavelets have already proven to be a valuable 

substitute for the Fourier transform in many applications. 

One possible idea, as proposed by several researchers, is to use wavelets as basis functions m a 
Galerkin method. This has proven to work and results in a linear system that is sparse because of 
the compact support of the wavelets, and that, after preconditioning, has a condition number 
independent of problem size because of the multiresolution structure. However, in this setting the 
wavelets do not provide significantly better results than more general multiresolution techniques 
(cfr. supra) and in fact one of their major properties, namely their (bi)orthogonality, is not 

exploited at all. 

Three questions are addressed in this research. The first, how can one make use of the 
(bi)orthogonality property of the wavelets? The second, which operators can be diagonalized by 
wavelets? The last, are fast algorithms available and what is their complexity? 
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PRELIMINARIES 


Notation and definitions 


Much of the notation will be presented as we go along. Here we just note that the inner product 
of two square integrable functions / ,g E L 2 ( IR ) is defined by 

( f,9 ) = J ^ f{x)g{x)dx , 

and that the Fourier transform of a function / is defined as 

f(u>) = y +X f(x)e-^ x dx. 

We say that a function w is an L-spline if 

L“Lw = 0 and w E C 2m 2 , 

where L “ is the adjoint of L, a linear differential operator of order m. This definition leads to the 
classical piecewise polynomial splines in case L = D m . 

Multiresolution analysis 


We give a brief review of wavelets and multiresolution analysis. For more information one can 
consult [6, 7, 8, 9]. A mxLltire solution analysis of L 2 (IR ) is defined as a set of closed subspaces Vj, 
with j 6 2Z, that exhibit the following properties: 

1. VjC V j+U 

2. u(x) E Vj O v(2x) E Vj+1 and v(x) E V 0 ^ v(x + 1) G V 0 , 

+ x. +«■■ 

3. [J Vj is dense in L 2 (IR) and p] Vj = {0}, 

4. A scaling function 4>(x) E Vo exists such that the set of functions {<pj t i(x) \ l E 2Z}, with 

<p jtl (x) = 4>{2 j x -l), is a Riesz basis of Vj. 

As a result there is a sequence {h k \ k E ZZ) such that the scaling function satisfies a refinement 

equation 

(f>(x) = 2 s jV,h k <p(2x — k). ( 1 ) 

k 
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Define W 3 now as a complementary space of V 3 in V j+ \ , such that V J+1 = Vj © W } (® stands for 
direct sum) and, consequently, 

©IT, = L\IR). 

3 

Note that this definition of Wj as a complementary space is non unique. 

A function ^(x) is a wareh t if the set of functions {ip(x — l) 1 1 G 2Z } is a Riesz basis of W Q . The 
set of wavelet functions {i>j,i{x) | l,j G ZZ } is then a Riesz basis of L 2 (IR). Since the wavelet is an 
element of Vi, it too satisfies a refinement relation, 

ip(x) = 2Y J 9k<i>{^x - k). ( 2 ) 

k 


There are dual functions 4>j,i{x) = V^0( 2 j x - l ) and = yffi'ip(2 j x - 1 ) that exist so that the 

projection operators Pj and Q 3 onto Vj and Wj, respectively, are given by 

p jf{x) = J2 and Qjf( x ) = 

i i 

The basis functions and dual functions are biorthogonal, 

= S i-v and = 8j-j'tii-v. (3) 

If the basis functions are orthogonal, they coincide with the dual functions and the projections are 
orthogonal. : ... 

The dual scaling function and wavelet satisfy 

4>i x ) = 2j2h k 4>(2x - k), i>(x) = 2'Y^ / gk<i>{2x — k), (4) 

k k 


and 

<j>(2x — k) = Y^, h k-2i4>{x -l) + Y^9k-2i , ip{x-l)- (5) 

" l i 

Taking the Fourier transform of the refinement equations (1) and (2) yields 


4>(u) = h{u/2) 4>(u)/2) with h(u) = ^2 h k e lku 

k 

and 

4>{v) = 9(v/ 2)V»(w/ 2), with g(u>) = J2S k § *“■ 

k - ■-■--r-Tr - • 

Here h(u) and g(u>) are 27r-periodic functions that correspond to discrete filters. Similar definitions 
and equations hold for the dual functions. A necessary condition for biorthogonality is then 


Vw G IR : u) = 1, 


where 


m(uj) 


h(u) h(uj + 7r) 
g{uj) g(u + tt) 


and similarly for fri(u ) . The existence of the dual filters is guaranteed by the following lemma: 
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Lemma 1 Hu space grnrraftd by the set of functions \ l €= ZZ} corn-pit rm ills Vj in Vj+ 1 if and 
only if 6(u) = <ht m{u) dots not vanish. 

The following statements are now equivalent : 

• The dual wavelet has M vanishing moments. 

• Any polynomial with degree less than M can be written as a linear combination of the 
functions <f>j t i{x) with l G .25?. 

• If / G C M , then the error of the approximation ||/ — Pjf\\ decays as 0(h M ) with h = 2~- 7 . 
These statements are also equivalent with the Strang-Fix condition [10]. 

The fast wavelet transform 


Since Vj is equal to Vj-\ © W)_i, a function Vj £ Vj can be written uniquely as the sum of a 
function u.,_i € V)_i and a function w j - 1 e W j- i: 

V j( X ) = YL v i,k<i>j,k{x) = Vj^(x) +Wj^(x) 

k 

= Y u i- 1,1 h- 1,1 (*) + Y Vj-hi V'j-i.i(x). 

i i 

There is a one-to-one relationship between the coefficients in the different representations. The 
decomposition formulae can be found using (4): 

Y hk-2i Vj,k, and = v5 Yiik-21 v jtk . 

k k 

The reconstruction step involves calculating the u jtk from the Vj-iy and the Pj-ip- Using (5) we have 

U j,k = v5 ^2 hk~2l + v5 ^2/9k-2l fj-j- 1,1- 

l l 

When applied recursively, these formulae define a transformation, the fast wavelet transform [8, 11]. 
The decomposition step consists of applying a low-pass (h) and a band-pass (g) filter followed by 
downsampling (i.e. retaining only the even index samples). The reconstruction consists of 
upsampling (i.e. adding a zero between every two samples) followed by filtering and addition. Note 
that the filter coefficients of the fast wavelet transform are given by the coefficients of the 
refinement equations. 

There are many constructions of wavelets. Here we shall only consider compactly supported 
wavelets as in [12, 13]. In this case the filters used in the fast wavelet transform are finite impulse 
response filters and a fast accurate implementation is assured. 
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General idea 


We shall assume that L is self-adjoint and positive definite and, in particular, we can write 

l = ry, 


where V* is the adjoint of V ■ We call V the square root opt rat or of L. Suppose that {'Ifyj } and 
I'jr* for an appropriate range of indices, are bases for S and S* respectively. The entries of the 
stiffness matrix are then given by 




Now, the idea is to let 

= V _1 ^,i and , 

where ip and ip are the wavelets of a classical multi resolution analysis. Because of the 
biorthogonality (3), the stiffness matrix becomes a diagonal matrix which can trivially be inverted. 
This avoids the use of an iterative algorithm. We will call the \f r and functions the operator 
wavelets and the ip functions the original wavelet s. The operator wavelets are biorthogonal with 
respect to the operator inner product, a property we refer to as operator biorthogonal. 

This idea can be powerful, but there are a few problems. First of all one has to check whether 
the operator wavelets still provide an multiresolution analysis where the successive approximations 
to a general function converge sufficiently fast (cfr the Strang-Fix condition) . Secondly one has to 
construct a fast wavelet transform for this operator multiresolution analysis. We want operator 
wavelets to be compactly supported and to be able to construct compactly supported operator 
scaling functions $j,i- We will see that the latter is not as simple as just applying V to the 
original scaling functions. 

The analysis is relatively straightforward for simple constant coefficient operators such as the 
Laplace and polyharmonic operator. For more general constant coefficient operators, we will show 
that one needs to modify the construction of the original wavelets for the operator wavelets to 
satisfy all the desired properties. We will discuss the Helmholz operator as a typical example. At 
the end of the paper we shall consider a variable coefficient operator. 

A similar idea was described in [14, 15]. However there only the operator wavelets of different 
levels are operator orthogonal and not the ones from the same level. As a result, one does not 
obtain a full diagonalization, but rather a decoupling of equations corresponding to different levels. 

Our idea is different from the technique presented in [16]. There wavelets are used to efficiently 
compute the inverse of the matrix that comes from a finite difference discretization. It is also shown 
that the wavelets provide a diagonal preconditioner which yields uniformly bounded condition 
numbers. 

In [17, 18] antiderivates of wavelets are used in a Galerkin method. This parallels our 
construction in the case of the Laplace or poly harmonic operator. 
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LAPLACE OPERATOR 


The one dimensional Laplace operator and its square root are 

L = -D 2 and V = D. 

The associated operator inner product is therefore ((u,v)) = (it', v' ) . Since the action of V~ l is 
simply taking the antiderivative here, we define the operator wavelets as 

&(x) = f ip(t)dt , and ^*(x) = f ip(t)dt. 

J— CO J — o o 

The operator wavelets are compactly supported because the integral of the original wavelets has to 
vanish. Also translation and dilation invariance is preserved, so we define 

= ^(2 j x — l) and ^ t (x) = \k*(2 j x — l). 

It is then easy to see that 

=2 for ezz. 

This means that the stiffness matrix is diagonal with powers of 2 on its diagonal. 

We now need to find an operator scaling function $. The antiderivative of the original scaling 
function is not compactly supported and hence not suited. We instead construct the operator 
scaling function $ by taking the convolution of the original scaling function with the indicator 
function on [0, 1], 

$ as 0*X[O,1], 

and similarly for the dual functions. We will show that these functions indeed generate a 
multiresolution analysis. To this end define 

Vj = clos span |leS} and Wj = clos span | l e 2Z}. 

We show that the Vj spaces are nested and that Wj complements Vj in V, +1 . 

In the Fourier domain we have 

$(a>) = : 0(w) and 4>(a;) = — ^(oj). 

iu iuj 

A simple calculation shows that the operator scaling function satisfies a refinement equation 

> * * 1 -L 

$(w) = $(u//2) H(u/2) with H(«t) = — — — h{u). 

2 

Consequently, the Vj spaces are nested. If we can find a function G such that 

$(w) = $(w/2)G(w/2), 
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then this implies that Wj is a subset of V j+l . It is easy to see that this holds with 

G{uj) — — - — —-g(uj). 

{ ' 2(1 — e _ * u ') yV ' 


This function is well defined because 5 ( 0 ) = 0. 
The space W 3 complements V 3 in V 3+ i if 


A(u;) = det 


H(uj) + 

G(u) G(u + tt) 


does not vanish, see lemma 1. In fact, we readily see that A(u/) — <5(u;)/4, and this cannot vanish 
since <p and ip generate a multiresolution analysis. The construction of the dual functions $ and 
\[r* from <f) and ip is competely similar. The coefficients of the trigonometric functions H, H~, G and 
G‘ now define a fast wavelet transform. 

Note that there is no reason why the operator scaling functions should be operator biorthogonal 
and in fact one can prove that this never happens. Note also that if true, this property would make 
the use of wavelets superfluous. 


Algorithm 


We will describe the algorithm in the case of periodic boundary conditions. This implies that the 
basis functions on the interval [0, 1] are just the periodization of the basis functions on the real line. 
Let S - V n and consider the basis {$ n>J | 0 ^ l < 2"}. Define vectors b and x such that 

t- 1 

hi = ( /, ) , and u = £ x t $ n ,t- 

1=0 

The Galerkin method with this basis then yields a system 

Ax = b With Afc,( = • 

As we mentioned earlier, the matrix A cannot be diagonal. Also its condition number grows as 
0( 2 2n ). Consider now the decomposition 

V n = Vo © Wo © • • • © W n - 1 , 


and the corresponding wavelet basis. The space Vo has dimension one and contains constant 
functions. We now switch to a one index notation such that the sets 


{1, V jtl | 0 ^ j < n, 0 ^ Z < 2 j ) and {tf * | 0 ^ k < 2 n } 
coincide. Define the vectors b and x such that 


bi = </,*?) 


2 ” -1 

and u = ’IT 

i=0 


We know that there exists matrices T and T * such that 

b = T* b and x = T x 
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The matrix T' corresponds to the fast wavelet transform decomposition with filters H r and G‘ and 
T corresponds to reconstruction with filters H and G. The complexity of the matrix vector 
multiplication is 0{N), N = 2 n . In the wavelet basis the system becomes 

Ax — b with A = T’ AT and Akj = )) • 

Since A is diagonal, it can be trivially inverted and the solution is then given by 

x = T A~ l T* b. 

This means that one has to calculate the wavelet decomposition of the right hand side, divide each 
coefficient by its corresponding diagonal element and reconstruct to find the solution. The 
complexity is 0{N). 

The constant basis function of Vo has a zero as diagonal element and its coefficient is thus 
undetermined. Note that this leads to an inconsistency if the integral of / does not vanish. 

Boundary conditions 


Our general idea to deal with boundary conditions is to let the operator wavelets satisfy the 
homogeneous boundary conditions and to let the component in the Vo space satisfy the imposed 
boundary conditions. This requires the use of special boundary wavelets as described in [19]. With 
only a slight change of basis one can then incorporate Dirichlet, Neumann, mixed and periodic 
boundary conditions. The details of this construction go beyond the scope of this paper. We will 
describe the construction in some specific cases. 


Example 

In this section we shall take a look at a simple example, namely the basis we get by starting 
from the Haar multiresolution analysis, where 

<f> = X[o,i] and i/>(x) = 4>{2x) - (f>{2x - 


Define the hat function as 

A = X[o,i] * X[o,i]i such that $ = A and \V(x) = A(2x). 

The original wavelets are orthogonal and as a consequence the basis functions and dual functions 
coincide. 

The operator scaling functions can represent linears which means they satisfy the Stang-Fix 
condition with M = 2 and the convergence is of order h 2 . One cam prove that higher order wavelets 
with more vanishing moments ( Af ) will in general not yield faster convergence because the solution 
u is not smooth enough. The underlying reason is that the solution u belongs to the Sobolev space 
IVfy One can get faster convergence only by imposing extra regularity conditions on the right hand 
side. So in a way this basis seems to be the most natural one to work with. Note that these 
piecewise linear basis functions are local solutions of the homogeneous equation such that the 
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Figure 1: Basis for Dirichlet problem. 


Figure 2: Basis for Neumann problem. 


operator scaling functions and wavelets are V-splines. This basis also coincides with Yserentant’s 
hierarchical basis. 

Figure 1 shows the basis functions in the case of Dirichlet boundary conditions and n = 3. The 
left part are the bases for the spaces V 0 up to V 3 while the right part are the bases for W 0 up to W 2 , 
which provide the diagonalization. The coefficients of the two functions in the Vo space are 
determined by the boundary conditions. The fast wavelet transform differs from the periodic 
algorithm here in the sense that different coefficients are used for the wavelets at the boundary. 
Note the “half hat” functions here. The basis in case of the Neumann problem is shown in figure 2. 
The boundary conditions are handled by the two functions in the V\ space. Again the coefficient of 
the constant is undetermined. The algorithm leads to an inconsistency in case the integral of / is 
not equal to u'(l) — ■u'(O). Note that in both cases the operator wavelets satisfy the homogeneous 
boundary conditions. 

MORE GENERAL CONSTANT COEFFICIENT OPERATORS 


The polyharmonic operator 


The polyharmonic equation is defined as 


-u (2m) = /, 

and the square root operator is now V = D m . The operator scaling function $ is now m times the 
convolution of the original scaling function <fi with the box function and the operator wavelet 'V is 
m times the antiderivative of the original wavelet ip. In order to get a compactly supported wavelet, 
the original wavelet now needs to have at least m vanishing moments, a property which can be 
satisfied by all known wavelet families. The construction and algorithm are then completely similar 




Figure 3: The refinement relation for the piecewise exponentials. 


to the case of the Laplace operator. 


The Helmholz operator 


The general definition of the one dimensional Helmholz operator is: 

L = —D 2 4- k 2 such that V = D + k. 

Here we shall assume that k = 1 which can always be obtained from a simple transformation. 
Observe that V = D 4- I = e~ x De x and thus V~ 1 = e~ x D~ l e x . One easily verifies that applying 
V -1 to a wavelet will not necessarily yield a compactly supported function since e x ipj t i in general 
does not have a vanishing integral. Therefore we let 'Fjj = V~ l e~ x ipj t i = e~ x D~ 1 ipjj. If [ has a 
vanishing integral, then ^ j is compactly supported. 

In order to diagonalize the stiffness matrix, the original wavelets now need to be orthogonal with 
respect to a weighted inner product with weight function e~ 2x because 

= / e~ 2x ip ji i(x)'ip j , tV (x)dx. 

J— X 

Finding such wavelets is a hard problem to solve in general. Inspired by the Haar basis, we 
construct a solution where the orthogonality of the wavelets on each level immediately follows from 
their disjoint support, by letting supp tpjj = [2~H, 2 + 1)]. To get orthogonality between the 
different levels, we need that Vj is orthogonal to Wj* for j' ^ j or 

J e~ 2x (f) ji i(x)'ip jltll (x)dx = 0 for f ^ j. 

We now let the scaling function coincide with e 2x on the support of the finer scale wavelets, 

= e 2 x Xj,i, 
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where Xj,i is the indicator function on the interval [2 H, 2 J + 1)], normalized such that the 
integral of the scaling functions is a constant. As in the Haar case we choose the wavelets as 

1pj,l = <f>j+l,2l ~ 

so that they have a vanishing integral. The orthogonality between levels now follows from the fact 
that the scaling functions coincide with e 2x on the support of the finer scale wavelets, and from the 
vanishing integral of the wavelets 

/ ~}*0G r+OO ^ rn, 

e~ 2x <f>j t i(x) dx = / ~ / i*j‘,i'{x)dx = 0. 

-oo 

One can see that the operator wavelets are now piecewise hyperbolic functions (piecewise 
combinations of e x and e~ x ). The scaling functions are chosen as 

= e~ x D~ l {(f> jy i — 4>j,i+i) so that 9 jtl = <F,+i ,2 1 - 


With the right normalization, one gets 


<M*) 


sinh(x — 12 J ) 
sinh(2~ J ) 

( sinh((i + 2)2 -J — x ) 
sinh(2 _J ) 


for x € [/2 \ (l + 1)2 J ] 
for x € [(l + 1)2 — J , ( l + 2)2 _J ] 


0 


elsewhere. 


The operator scaling functions on one level are translates of each other but the ones on different 
levels are no longer dilates of each other. They are supported on exactly the same sets as the ones in 
figure 1 and they roughly look similar. The operator scaling functions satisfy a refinement relation 

2 

fc =0 


with 

Hi = Hi — sinh(2~ ^_1 )/sinh(2~• , ) and H{ = 1. 

Figure 3 shows the refinement relation for the scaling functions. The 3 finer scale functions are not 
the dilates of the coarse scale one but they still add up to it. 

The Helmholz operator in this basis of hyperbolic wavelets again is diagonal and the algorithm is 
completely similar to the Laplace case. The only difference in implementation is that the filter 
coefficients H{ used in the fast wavelet transform now depend on the level. 

Note that these functions again are F-splines and, in a way, are the most natural to work with. 
Also note that 

lim 4>j 0 (2~ J x) = A(x). 

j— X 

Despite the fact that the Strang- Fix conditions are not satisfied, one can prove that the 
convergence is still of order h 2 . 

So we can conclude that a wavelet transform can diagonalize constant coefficient operators 
similar to the Fourier transform. The resulting algorithm is a little faster (O(N) instead of 
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(D(N log N)). This gain in speed is a consequence of the subsampling of the coarser levels in the 
wavelet transform (the ones that correspond to the low frequency components of the solution) 
which is not present in the Fourier transform. Also boundary conditions are taken care of more 
easily than in the Fourier case. 


VARIABLE COEFFICIENTS 


Naturally, the next question is how to use wavelets for variable coefficient operators. The 
underlying reason why wavelets can diagonalize constant coefficient operators is their locality in the 
frequency domain. We want to understand if we can exploit the localization in space to diagonalize 
variable coefficient operators. The answer is (perhaps quite surprisingly) yes and this really justifies 
the use of wavelets for differential equations. No other technique (to our knowledge) has been able 
to accomplish this. 

We take a closer look at the following operator 

L = -Dp 2 (i)D, 

where p is sufficiently smooth and positive. The square root is now V = pD and V' 1 = D' 1 1/p. 
The rest of the analysis is very similar to the case of the Helmholz operator. Applying V -1 directly 
to a wavelet does not yield a compactly supported function. We therefore take = V -1 pip ]t i 
which implies that the wavelets need to be (bi)orthogonal with respect to a weighted inner product 
with p 2 as weight function. We use the same trick as for the Helmholz equation to construct such 
functions. This means that we let the scaling functions <f>jj coincide with 1/p 2 on the dyadic 
interval [2~H t 2~ j (l + 1)] and normalize them such that they have a constant integral. We then take 
the wavelets i/'j.j to be equal to <f>j + 1,2 1 ~ 4>j+i,n+\ so they have a vanishing integral and the operator 
wavelets are compactly supported. The operator wavelets are now piecewise functions that locally 
look like AP + B where P is the antiderivative of 1/p 2 and again are V-splines. Their support also 
coincides with the support of the functions of figure 1, and since p is smooth they will converge to 
hat functions as the level goes to infinity. The operator wavelets are neither dilates nor translates of 
one function, since their behavior locally depends on p. This is not a problem because they still 
generate a multiresolution analysis and satisfy refinement relations. The coefficients in the fast 
wavelet transform are now different everywhere and they depend in a very simple way on the Haar 
wavelet transform of 1/p 2 . The entries of the diagonal stiffness matrix can be calculated from the 
wavelet transform of 1/p 2 . The algorithm is completely similar to previous cases and is of order N. 
Boundary conditions are as easy to handle as in the case of the Laplace operator. Note that the 
operator scaling functions do not satisfy the Strang-Fix conditions. It is however again possible to 
prove that the method has a convergence of order h 2 . As mentioned earlier, higher convergence 
orders can not be obtained in general. 

NUMERICAL EXAMPLE 


We solve the equation 

-De^ Du(x) = e x ^sin(x)(3x 2 — 2) + cos(x)(2a: — 2x 3 )) / x 3 , with u(0) = 1 and u(l) = sin(l), 
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l 

Lqc, error 

1 

1.22e-02 

2 

3.37e-03 

3 

8.66e-04 

4 

2.18e-04 

5 

5.45e-05 

6 

1.36e-05 

7 

3.41e-06 

8 

8.52e-07 

9 

2.13e-07 


such that the exact solution is given by u(x) = sm(x)/x. The error of the numerically 
computed solution is a function of the number of levels (/) shown in the above table . Each time 
the number of levels is increased the error is divided almost exactly by a factor of 4, which agrees 
with the 0(h 2 ) convergence. 


CONCLUSION 

In this paper we showed how wavelets can be adapted to be useful in the solution of differential 
equations. Like the Fourier transform, wavelets can diagonalize constant coefficient operators. The 
resulting algorithm is slightly faster. The main result however is that even non-constant coefficient 
operators can be diagonalized with the right choice of basis which evidently yields a much faster 
algorithm than more classical iterative methods. 

This technique can also be applied to the solution of implicit time stepping discretizations of 
equations of the form du/dt = Lu + f even when L is non-linear. Future research includes the 
study of non self adjoint operators where a splitting L = VV~ is needed and the study of the 
possible generalization of these ideas to partial differential equations. 
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SUMMARY 


This study is devoted to a comparative analysis of three "Adaptive ZOOM" 
(ZOom Overlapping Multi-level) methods based on similar concepts of 

hierarchical multigrid local refinement : L.D.C. (Local Defect Correction), 

F.A.C. (Fast Adaptive Composite), and F.I.C. (Flux Interface Correction), 

which we proposed recently. These methods are tested on two examples of a 
bidimensional elliptic problem. We compare, for V-cycle procedures, the 
asymptotic evolution of the global error evaluated by discrete norms, the 

corresponding local errors, and the convergence rates of these algorithms. 

INTRODUCTION 


The need for local resolution in physical models occurs frequently in 

practice. Special local features of the operator coefficients, source terms, 

and boundary conditions can demand resolution in restricted regions of the 
domain that is much finer than the required global resolution. The multigrid 
methods with local mesh refinement provide one solution method to achieve 

efficient local resolution by solving problems on various locally nested 

PAGE 



INTENTIONALLY BLANK 


275 


PRECEDING PAGE BLANK NOT FILMED 


grids, and by using these grids as a basis for fast solution and correction on 
the global basic grid of the calculation domain. Different techniques have 

been proposed in the literature, such as the pioneering works [1,2,3,4,5]. 

Therefore, the concept of "Computational Adaptive Zoom in the context of 

a "Graphical and Computational Architecture” has been introduced in the field 
of numerical simulation in order to take the best advantage of the new 

capabilities of high performance computer architectures [6]. It can be viewed 

as a generation made automatically (i.e. in an adaptive way) or not, of some 

multilevel hierarchical local nested zoom grids (ZG), overlapped all over the 

global basic grid (BG). These grids may move all over the entire computation 

domain Q during the solution phase. This concept is supposed to allow both 

local refinement and global correction of the basic grid solution by a 

successive transfer of information between the connected grids (BG) and (ZG). 

So it is well adapted to a graphical vision of Zoom in terms of the creation 

of local graphical windows where it is needed in the problem (strong 

gradients, discontinuities, singularities,...), but in an active sense, i.e., 

the basic grid solution is modified and improved as the computing is 

performed. This has involved us in the creation of an original engineering 

software package called " AQUILON ", still currently in development [6]. 

In addition, this strategy offers other interests. The goal is to combine 

the best features of both multigrid techniques and domain decomposition 

methods (in the case of overlapping grids) to provide an acceleration of the 

convergence rate and a good suitability for implementation on parallel 
computers, thus reducing the ellapse time. Moreover, another advantage is the 
possibilty to solve different differential problems on the grids (BG) and 

(ZG), which allows us to optimize both the physical and the numerical model. 

This can be particularly interesting for the approach of solving problems by 
"imbedding inside fictitious domains" associated with appropriate "control 

terms" for expressing the boundary conditions, as proposed in [6]. It is also 

possible to adopt different kinds of discretization on each grid. Thereby, the 
multigrid zoom methods share with the domain decomposition techniques the 

opportunity for obtaining precise solutions by combining solutions to problems 
posed on physical subdomains, or, more generally, by combining solutions to 

appropriately constructed continuous and discrete boundary value sub-problems. 

From the numerical point of view, the strategy adopted enables us to work 
only on structured and uniform meshes for each grid separately, on which a 
moderate number of degrees of freedom is required. On each grid, a simple and 
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inexpensive" discretization is performed, leading to the same simple form of 
sparse pattern matrices (e.g. 2D block-tridiagonal). We aim at avoiding 

solving problems on unstructured or nonuniform composite meshes, which tends 
to introduce inaccuracies in the discretization, slowness in the solvers, and 
being surely more expensive in terms of implementation, data structures 
storage and CPU time. Our choice is expected to be relatively good in terms of 
duality quality/cost of computation for a lot of cases of moderate complexity. 

MULTIGRID ZOOM ALGORITHMS 

Different ZOOM algorithms will be examined and compared. We consider 
first the L.D.C. (Local Defect Correction) algorithm proposed by Hackbush [1]; 
we choose for the restriction operator a 2D bilinear interpolation one of type 

"full weighting control volume". The second one belongs to the class of F.A.C. 
(Fast Adaptive Composite Grid) methods from McCormick [5], for which the 

analogy with the B.E.P.S. method [4] can be noticed. We use here the "delayed 
correction" version of F.A.C. Only the third one, the F.I.C. (Flux Interface 
Correction) algorithm that we proposed more recently [7], will be briefly 

described hereafter. 

All these Multigrid Zoom Algorithms are based on the same general 

principle : a successive transfer of information level by level, leading to 

the global correction of the initial discrete solution on each grid, and thus 

on the global basic grid (BG). The multilevel implementation is made in a 
recursive way as in the usual multigrid techniques (V-Cycles, W-Cycles, etc .) 

[1,3]. The resolution on each grid may be performed "exactly" or by using an 
inexact solve (e.g. a few iterations of a smoothing procedure). 

Notations and Definitions 

Consider the following second order non-linear elliptic boundary value 

problem defined on Q a bounded, open domain in lR d , for d = 2 or 3 : 

(T) / L ( u ) = d*v(cp(u)) + G(u) =f(x) x e Q (1) 

^ well-posed boundary conditions on F = 9Q symbolically called by (BC) 

The equation (1) L(u) s f is so expressed by splitting the nonlinear 
operator L(u) in the divergent part where cp(u) has the physical meaning of the 

flux density of the solution u= u(x) and the nonconservative one G = G(u). The 
relation between the solution u and the flux (p can take the general vector 
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form (p(u) = F(u) in many systems of conservation laws, but applications will 
concern an advection-diffusion equation or a Navier-Stokes problem. For the 

experiments here, (P) is a diffusion problem and we have cp(u) = -a.gradu , 

In order not to have too many formal requirements and restrictions, we 
assume explicitly only that this equation (1) has at least one isolated 

solution u* in the space L 2 (Q). All other assumptions are implicitly contained 
in the following considerations. 

The basic notations will be those classically used in the multigrid 

framework [1], We denote by £ the current index of the grid level (0< £ <, £ ), 
£ = 0 is the level of the global basic grid (BG) which discretizes the entire 
calculation domain D, and £ = £ ^0 is the level of the most nested and finest 

zoom grid (ZG). Each grid of level £ can be characterized by : 


- the open domai n - | £ j 

- the boundary = j | j- on which can be 
defined the unit outside normal vector 

- the closure U 

- the mesh size h^ 


Each grid of level £ is divided into a set of control volumes V 

associated to the nodes x e . We denote by r^ +1 the interface between two 
successive grids of level £ and £+1 and we have V£, Q,^ n 0. The 

successive mesh sizes will be taken as h^ + j = h^ / 2 P , p e IN . The following 

notations will also be used : n and n . 

The transfer operators between the grids £ and £+1 will be called, 

respectively, by R^,, for the restriction operator and by P^ +1 for the 

<- +i £+1 c 
prolongation operator. For all three algorithms, we have chosen P £ as : 

P £ : r £,£+l ° n £ * r £+l X ( F £+l n ^ 
which is a monodimensional linear interpolation operator defined on the 

interface of the grids £ and £+1. Each value u^ + ^ at a node y e r) 

on the interface is obtained by a linear interpolation of the values u^ at the 

two neighbour nodes x et x’ belonging to (T^ n IT^ ), and thus verifying 

Vi (y) = u £ (x) if y = x • 
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If we denote by L £ u £ = the discretized equation of (1) on the grid of 
level £ , we can define the following discrete boundary value problems on n : 

in 

(BC) J. t * 0 

u 


(T ) 
v o' 


L u = f 
0 0 0 

on r 


in Q 

c 

(BC) 




1 


L l a l = f l 

onr { nr 

on r £ \ (r £ n D 


£ 


= P £-l Vl (2) 


We denote by u^ the discrete solution obtained on the grid at the k-th 
iteration of the zoom algorithm, and e^ = uj - u^ the associated discrete 

error, where u £ is the natural restriction of the exact solution u‘ of problem 
(?) on n £ . 

For 0 < £ < £*, y(e) win represent the number of iterations of the zoom 

algorithm on the grid level £ in order to describe a whole cycle : if we have 

Y(£)=l (respectively 7(0=2), V0< £ <l , then V-cycIes (respectively W-cycles) 

will be described. We have y (l) = 1, and y(0) is the total number of cycles 

performed from the basic grid (BG) in order to obtain the so-called 

convergence of the zoom algorithm. When £=1 (i.e. for a two-grid algorithm), 

only V-cycles are of course carried out. The term "No Zoom" will be used for 

the resolution by "an exact solve" of problem (T) on the basic grid (BG) of 
mesh size h Q (1=0, k=0). The term "Zoom" will be used to indicate that some 
iterations of a multilevel zoom algorithm have been performed: 1*0, l<k<y(0). 


Description of The Multilevel F.I.C. Algorithm 

The main idea of the two-grid FIC method for levels i and £+1 is to give 

the opportunity to apply the local "flux residual" correction due to the whole 

patch level £+1, at each node x 6 on the grid level i. This is obtained 

through the expression of the local flux balance (i.e. integration) of eq.(l) 

over the volume = V* n between the grid levels i, on one hand, and 

£+1 on the other hand. 

Because of the consistency of the conservative discretization of the 

fluxes by the finite volume method, which must be respected on each grid, the 

outside normal fluxes of <p(u) through an interface of two neighbour control 

volumes are opposite. By giving more importance to the local "flux residual", 
that leads to consider for the correction step on the grid level £, the local 

flux of the defect only at each node of a boundary zone I^ +1 defined as the 

flux correction interface". We can choose for I^ +1 , either the stripe A + = 

{u V* , x € = r^ +1 n Q^}, or A' (see further Figure) if we want the 


boundary 3A to correspond to interfaces between control volumes on the grid 
level l : we will have = V x in the latter case. We define on the grid level 
l, av_= r. u r, Vx e \ = \ £+1 n , Where T = 3V x n 3^ * 0, or 


respectively, r^= 9V x O 8A^ , (mes (f^) ) 
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We then propose the following restriction operator on the outside normal 
flux through the "interface boundary" y £>£+1 = {u , x e I £ } : 

R £+l : \f+l ° ^+1 
1 


r £,£+i n n e 


R *iW u)J W (x) = 


mes(f ) 


. Vi (u) n (+i d7 Vx £ *«+i ° (3) 


We can then define, as in [7], the local "flux residual" correction at 
each node x e I { = I y+ , n O, on the grid level ( by : 


r ( «p)M = I R ll ( Vl<"> J W> ' W 


(x) (4) 


U+l KH, l+l KUJ 1 

The control parameter e(£,x), which has the dimension of a length, has 
already been encountered in order to assign Neumann and Robin (or Fourier) 
boundary conditions in the context of "imbedding inside a fictitious domain" 
[6]. Its expression is given by : 

mes (V ) 

«<*> - (5) 

mes (T^) 

A complete calculation, still not published, gives a complex expression 

for oo(£,x)(u), which is the following one in the case under subject of G = 0 : 
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( 6 ) 


f 

cp(u).n dy 

. 

/ 

% i 

(p(u).n dy 


r 

r * 

£+i 

„ - 

r 

r 


CD(£,x)(u) “ 1 + 


r 

«% % 
cp(u).n dy 



* 

<p(u).n dy 


r 

x * 

e+ 1 


r 

X 


We can then generate the successive iterates by the multilevel FIC 
algorithm implemented in a recursive way : 

I n i tialization : comp ute u 0 

0 ! i P 

u 0 is obtained by resolution of problem (P Q ) 

1 1 e ra t i on s : compute the successive iterates u^ 
for k = 1 to y(0) do FIC(O) 

Composite re • ac t ual i za t i on : providing uj (0) on (BG) by assigning 
for l = l * -1 to 0 by step of -1 : uj (0) (x) = ujjj 5 (x) Vx e 

Pr o cedure FIC ( i ) 

If l = l Then solve problem {T ^*) Else 
begin 

* 1 st step - resolution on the grid level t + 1 : 

- solve problem (?^ + |) providing u f+1 

- for k = 1 to y ( i + l ) do FIC(£+1 ) 

* 2 nd step - correction on the grid level l : 

- solve problem (T ^ with f £ = f g + x r r £ ( cp ) 

where r £ ( cp ) is computed by equations (3)(4)(5)(6) 

and Xi is the characteristic function of I . inll„ 
e nd l It 

Remarks : 

1) - In any case, in order to avoid the explicit calculation of co(£,x)(u) 
by eq.(6), an economical solution is to use an approximate correction for FIC. 

In that version, called FIC(co), only the flux integrals on the interface T 
will be evaluated by quadrature formulae (Simpson), and an average weighting 

factor co (t) will be determined by a semi-empirical way for each grid level. 

Besides, it can play the role of an average relaxation parameter for the 
iterative zoom algorithm when co(£)=co, 
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2) - In terms of domain decomposition, the two-grid FIC for levels £ and 

£+1 can be regarded as a full overlapping iterative algorithm that splits the 

whole composite problem in two Dirichlet/ Neumann boundary value sub-problems. 

- the problem on the grid level £+1 with a Dirichlet boundary condition 

on the interface ^ (2), 

the problem on the level £ with a condition of relaxed transmission of 

the flux on the interface through (4), which demands the flux 

continuity at convergence. That condition can be considered as a Neumann 

boundary condition on by the technique of "fictitious domain in [6]. 

General Comments on the Three Algorithms 

i) - The two-grid FAC method for levels £ and £+1 can be regarded as an 

iterative procedure to solve "exactly" the discrete composite problem coming 

from an adequate discretization of problem (T) on the composite grid 17^ 

defined by the association of the grids and n^ + j. Therefore, the principle 

is to apply a multigrid algorithm between the grids and on one hand, and 

between the grids and j on the other [5,4]. There is therefore a 

correction phase on both the grid levels £ and £+1 with respect to the 

discretization on the composite grid. In that sense, FAC can be viewed as an 

"exact" solver for the composite problem. Because the composite grid stencils 

agree with the coarse and fine grid stencils, respectively, outside and inside 
the refinement region, and because the correction equations are solved 

exactly, the composite grid residual is nonzero only at the interface. 

ii) - Due to the attention needed for the nonuniform discretization of the 

problem on the interface zone of the composite grid, FAC method can prove to 

be a little difficult to implement in a more than two grids version. 

in) - On the contrary, LDC and FIC methods, which are easier implementing 

in the multilevel case, are only approximate solvers : they don’t use a 

composite grid and neglect the fine grid residual correction. The former 

consists in the local correction of the solution defect inside A^ as the 

latter involves a local flux residual correction through the interface £_)_ j ■ 

iv) - Both FAC and FIC methods provide corrections by balancing fluxes 
computed from both coarse and fine grids across the interface. They take the 

best advantage of a conservative discretization of the equations, for example, 

by a finite volume technique. 
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NUMERICAL APPLICATIONS 


In that context, we propose to compare three types of multigrid zoom 

algorithms on two examples of a linear elliptic problem (T) presenting, 

respectively, a discontinuity of the operator coefficients for (Tl) [8], and a 
singularity of the exact solution for (P2) [1] : 


CP) 


L(u) = -di v(a(x).gradu) + a(x) u = 
- a, a >0 e L 00 ^) et f e L 2 (Q) 
well -posed boundary conditions on 


f(x) 


r = dn 


in Q =]0,1[ X]0,1[ 
symbol i cally ca 1 led by 


O’) 

(BC) 


These problems were already tested successfully on the FIC method in [7]. 

Problem (Tl) is heterogeneous and defined by feO, a=0, a=100 inside a disk of 

radius 0.1 and osl outside (Fig. la). A solution computed on a very fine basic 

mesh (5 1 2 2 ) will be used as the reference solution u\ Problem ( T2 ) is defined 

by M), a^O, a^l (Fig. lb); the exact solution is u* = ln(r) with r= / x 2 + y 2 . 


Numerical Implementation and Procedures 


The discretization on each grid, independant of the geometry of the 

problem, is made in a conservative way by a finite volume method on a uniform 

Cartesian mesh. The classical five-point scheme is used providing a second 

order precision. The resolution of the linear systems, which are 

block-tridiagonal and symmetric positive definite, is performed by a fast and 

efficient solver : a preconditionned conjugate gradient (PCG) method (CG-SSOR) 

vectorized by a Red and Black numbering of unknowns. The results for two grids 

are obtained by an "exact" solve on each grid. The results for multilevel LDC 

★ 

or FIC (£ > 1) are given for an "inexact" solve on each grid (including 

Fig.5b), i.e., a fixed number itcg of iterations of PCG on each grid with : 

itcg=2 for h Q = 1/8 itcg = 4 for h Q = 1/16 itcg=8 for h Q = 1/32 

The results are analyzed with different norms (L°°, L 2 , L-energy norm) of 
the discrete error evaluated on the global basic grid (BG, £=0). We study the 

asymptotic evolution of the relative error norms I |e°| I / I tu*l I (No Zoom) and 

= I kg ( ° I / I lu*l I (after y(0) Zoom iterations) with = u^ - u* , which 
allows us to estimate an asymptotic average rate t : 
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for l = 1, as function of h j or p (for a fixed h Q ) 

o ' 1/m 


X = 


e 0> <P=n-) 


, with m = max 1 p e IN 


| p e IN* | 


Here m=3 and y(0)=2, see Tab.l, Tab.2, and Fig.2a, Fig. 3, and Fig.4. 

fix 
1/m 


* for i > 1, as function of l (for a fixed h Q and p=l) 


x = 


Z>° 


e 0) (( - =m) 


, wi th m = max < t e IN 


| l e in* | 


Here m=3 and y(0)=10, see Tab. 3, and Fig.2b. 


The convergence rate of LDC, FAC and FIC have been also compared (Tab.4): 

* for FAC : we study the variations of the Euclidean norm of the 
composite residual |rj(u)| 2 for k = 1 to y(0) (Fig.5a), and a convergence rate 

p is calculated by a geometric mean : 


P = 


r Y(°> 




l/(yco)-l) 


l I r o ( u )| 2 J 

study the varial 

for k = 1 to y(0) (Fig.5b), and a convergence rate p is then estimated by : 


1c lc 

* for LDC or FIC : we study the variations of quantities 5 q = | [u Q 


k-1 

u o Ml 2 


P = 


f 


> i/(yco)-i) 


5 1 




Comparative Numerical Results 

1) - By comparing a no-zoom method and a ZOOM one, we notice that the 

error globally decreases ; between two increments of p or l , it is divided by 

an elevated average t-factor of between 1.5 and 3.5 (Tab.l, Tab.2, Tab.3). For 

problem (T2), the decrease is monotonic and there seems to be good analogy 

4 4 

between the variation of the error as a function of p (for i =1) or of i (for 
p=l), (see Fig. 2a and Fig.2b). The influence of the position and dimensions of 

the local grids (ZG) becomes negligible as h Q decreases [7]. Due to the choice 
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of discretizing on a Cartesian mesh independently of the geometry of the 

heterogeneity, the error for problem (T\) does not decrease monotonically as 
already noticed in [7,8]. 

2) - In many cases, the error obtained with zoom is less than computed 

without zoom on a single basic grid of mesh size h < h * In particular, Fig.4 

k o t 

shows that the local discrete error |e^| calculated point by point on the 

diagonal of the domain (T2), by a two-grid FIC method (h o =l/16, h^h J2, k=2) 

is globally better than the error obtained with No Zoom at the corresponding 

nodes of BG (1=0, h Q =l/32). The former results are more accurate inside the 
refinement region and get closer to the latter case far from the interface. Such 

remarks can also be made for the discrete error norms in the other Tables or 

Figures. However, the error is not reduced beyond a threshold value consistent 

with the order of precision of the discretization schemes on the different 
grids (cf, the multigrid defect correction method using Richardson 
extrapolation [1]). 

3) - The two-grid FAC and FIC methods yield error results of the same 

order of magnitude for both problems. These results are far better than for 

LDC for problem (T\), where the flux conservation plays an important role. On 
the contrary, LDC yields as good results as the others for problem (T2), and 

sometimes better. However, as LDC does not deal with the interface fluxes, but 
only works on the solution inside the open refinement region, it can become 

inefficient (x = 1) if the refinement region does not contain enough coarse 

nodes on which the local defect correction is performed (Tab.l, Tab.2, Tab.3). 

4) - The results with the version FIC(co) for 0,1 < go < 0.5 are nearly 

similar to those obtained with co*= co(£,x)(u) calculated by (6) (Fig.3). That 

could justify the interest of the approximate version FIC(co), and particularly 

as a preconditioner of the discrete problem, as suggested in [4], 

5) - Because of its exact character, the FAC method yields the far best 

convergence rate, a mean value of 0.16, nearly independant of both h and h 

(Fig.5a and Tab.4). We obtain a mean convergence rate of 0.42 for FIC(co=0.2), 
just a little better than LDC with 0.50 , These convergence rates remain not 

very sensitive to the variations of h Q and i (Fig.5b and Tab.4). However, 

those of FIC have a noticeable tendency to become better as the number of grid 

levels (or i) increases (see Tab.4). 
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CONCLUSION 


Despite its non-exact character, FIC provides as good results as FAC, 
concerning the analysis of discrete errors for both the two tested problems. 
In particular, FAC and FIC proved to be better than LDC for problems where the 

flux conservation locally plays a main role. 

FAC yields very good convergence rates (p-0.16), better than LDC (p-0.50) 
or FIC (p*0.42), but its multilevel implementation remains more difficult. 

However, the use of FIC as a preconditioning technique of the discrete problem 
is likely to be very interesting, especially for the approximate version 

FIC(oo) where the factor co becomes a relaxation parameter. We are currently 
testing such a procedure for Navier-Stokes problems. 
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Tab. 1. 


Problem [Tl] - Two Grid Zoom 9^(0) = 2 - Discrete L 2 norm of the error 


h o 

NO ZOOM 

h =^ 
1 2 P 

ZOOM x 

i 

=0.375 et x z =0.625 

J ZOOM x^O.25 et x z *0. 75 

LDC 

FAC 

FIC 

w = 0.55 

LDC 

FAC 

FIC 

w = 0. 40 

1/8 

0. 434E-1 

P=1 

0. 434E-1 

0. 493E-1 

0. 210E-1 

0. 434E-1 

0. 732E-2 

0. 639E-2 

P=2 

0.434E-1 

0.790E-1 

0. 122E-1 

0. 434E-1 

0. 133E-1 

0. 106E-1 

P~3 

0. 434E-1 

0.330E-1 

0. 589E-2 

0.434E-1 

0.289E-2 

0.251E-2 

T 

1.00 

1.10 

1.95 

1.00 

2.47 

2.59 

1/16 

0. 689E-2 

P-1 

0.689E-2 

0. 167E-1 

0. 119E-1 

0.739E-2 

0. 986E-2 

0. 106E-1 

p=2 

0. 689E-2 

0. 374E-2 

0. 394E-2 

0. 297E-2 

0. 123E-2 

0. 161E-2 

p=3 

0.689E-2 

0.255E-2 

0.339E-2 

0.214E-2 

0. 673E-3 

0.592E-3 

T 

1.00 

1.39 

1.27 

1.48 

2. 17 

2.27 

1/32 

0. 982E-2 ■ 

P-1 

0. 136E-1 

0.342E-2 

0. 538E-2 

0. 244E-2 

0. 160E-2 

0.201E-2 

P-2 

0. 140E-1 

0. 295E-2 

0.412E-2 

0. 174E-2 

0. 428E-3 

0. 392E-3 


p=3 i 

0. 134E-1 

0. 326E-2 

0.331E-2 

0. 153E-2 

0. 233E-3 

0. 262E-3 


T 

0.90 

1.44 

1.44 

1.86 

3.48 

3.35 

1/64 

0. 192E-2 



1/128 

0. 648E-3 


1/256 

0. 193E-3 


Tab. 2. Problem (T2) - Two Grid Zoom 9 ^( 0 ) - 2 - Discrete L-Energy norm of the error 


h 

0 

NO ZOOM 

1 

ZOOM 

x t =0 et 

x 2 =0.25 

ZOOM 

x i *0 et 

x 2 *0. 5 



2 y 

LDC 

FAC 

FIC 

u = 0.20 

LDC 

FAC 

FIC 

w * 0.20 



p=l 

0. 342E-1 

0. 152E-1 

0. 168E-1 

0. 120E-1 

0. 138E-1 

0. 139E-1 

1/8 

0. 342E-1 

P-2 

0. 342E-1 

0. 817E-2 

0. 110E-1 

0. 441E-2 

0.511E-2 

0. 553E-2 

p=3 

0. 342E-1 

0. 622E-2 

0. 990E-2 

0.221E-2 

0. 250E-2 

0. 346E-2] 



T 

1,00 

1.77 

1.51 

2.49 

2. 39 

2.15 



P-1 

0. 723E-2 

0. 829E-2 

0. 837E-2 

0. 704E-2 

0. 818E-2 

0.819E-2 

1/16 

0.206E-1 

p=2 

0.270E-2 

0. 308E-2 

0. 337E-2 

0. 232E-2 

0. 282E-2 

0. 285E-2 

P-3 

0. 140E-2 

0. 152E-2 

0. 215E-2 

0.627E-3 

0. 975E-3 

0. 107E-2 



T 

2.45 

2.38 

2. 12 

3.20 

2.76 

2.68 



P = 1 

0. 454E-2 

0. 527E-2 

0. 527E-2 

0. 453E-2 

0. 526E-2 

0. 526E-2 

1/32 

0. 133E-1 

p=2 

0. 150E-2 

0. 182E-2 

0. 184E-2 

0. 149E-2 

0. 180E-2 

0. 180E-2 


P-3 

0. 407E-3 

0. 628E-3 

0.699E-3 

0.387E-3 

0.570E-3 

0. 577E-3 



T 

3.20 

2.77 

2.67 

3.25 

2.86 

2.85 

1/64 

0. 888E-2 








1/128 

0. 607E-2 








1/256 

0. 420E-2 








1/5121 

0.294E-2 
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Summary The new multi-grid (or adaptive) pseudospectral element method has been carried out 
for the solution of incompressible flow in terms of primitive variable formulation. The desired features 
of the proposed method include (1) the ability to treat complex geometry; (2) the high resolution 
adapted in the interesting areas; (3) the minimal working space; and (4) effective under the multiple 
processors working environment. 

The approach for flow problems, complex geometry or not, is to first divide the computational 
domain into a number of fine-grid and coarse-grid subdomains with the inter-overlapping area. Next, 
implement the Schwarz alternating procedure (SAP) to exchange the data among subdomains, where 
the coarse-grid correction is used to remove the high frequency error that occurs when the data 
interpolation from the fine-grid subdomain to the coarse-grid subdomain is conducted. The strategy 
behind the coarse-grid correction is to adopt the operator of the divergence of the velocity field, which 
intrinsically links the pressure equation, into this process. The solution of each subdomain can be 
efficiently solved by the direct (or iterative) eigenfunction expansion technique with the least storage 
requirement, i.e., 0{N 3 ) in 3-D and 0(N 2 ) in 2-D. 

Numerical results of both driven cavity and jet flow will be presented in the paper to account for 
the versatility of the proposed method. 
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1 Introduction 

Due to the advance of numerical techniques, numerous CFD algorithms have been developed to 
pursue the hard-to-approach flow problems. Nevertheless, numerical algorithms should have desired 
features of (1) the ability to deal with the variety of geometrical shapes; (2) arbitrary layout of dense 
grid points in the interesting areas; (3) the minimal working space; and (4) the low computational 
time to achieve such a goal. The development of a pseudospectral element method in these areas is 
our major concern. 

One of the improvements in the area of feature (2) is the multi-grid technique, which has long been 
advocated by the finite-difference method [1, 2], On the same computational domain, a sequence of 
uniform grids are employed to accelerate the convergence of iterative methods. The work rests on the 
“standard coarsening,” i.e., doubling the mesh in each direction from one grid to the next coarsest 
grid and also smoothing the residual to the next coarse grid (restriction). Solve the problem on the 
coarse grid (low frequency domain) and the coarse-grid correction transfers back (prolongation) to 
the fine grid (high frequency domain) to gain rapid convergence. The technique developed so far, 
even with the inclusion of an adaptive scheme, is still limited to the simple complex geometry with 
uniform grids in the Cartesian coordinates, but is less for the non-uniform grids in the curvilinear 
coordinates. 

The SAP iterative scheme has been successfully applied by the pseudospectral element method 
to those (simple complex) configurations where the overlapped grids are located at the same places 
[3, 4]. Here we refer to such cases as a single-grid SAP method because no error is involved during the 
data interpolation process. But under some circumstances, due to the complexity of the geometrical 
configuration such as possible layout of mixed types of grids (Cartesian, “0” or “C”) or the necessity 
of applying adaptive fine grids for high resolution in one area and coarse grids for less resolution 
in others, the overlapped grids cannot be collapsed at the same position. Careful treatment on the 
overlapped grids by the SAP iterative scheme to eliminate the high frequency error due to the data 
interpolation will be the main objective in this paper. On the other hand, the question arises of 
how the continuity equation is satisfied in the overlapping area (including the interfaces between the 
fine-grid and coarse-grid subdomains) when solving the incompressible Navier-Stokes equations in 
primitive variable form. It reflects the fact that the boundary conditions for the pressure should link 
the incompressibility constraint in some respects. Extension of single-grid SAP to the multi-grid SAP 
domain decomposition method to overcome the above-mentioned difficulties will also be addressed. 

The paper consists of five additional sections. Section 2 derives a primitive variable form of the 
Navier-Stokes equations. Section 3 discusses the multi-grid SAP domain decomposition method. 
Section 4 presents numerical results of proposed 2-D problems, and the final section provides the 
conclusions. 


2 Primitive Variable Formulation 


In tensor notation, the time-dependent Navier-Stokes equations in dimensionless form can be de- 


scribed as 


dui du{ _ dp 1 cPui 
dt + Uj dxj dxi Re dx 2 j 

P = ° 

oxi 


(la) 

(lb) 
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Here u, is the velocity component and Re is the Reynolds number. 

The method applied to solve the Navier-Stokes equations is Chorin’s [5] splitting technique. Ac- 
cording to this scheme, the equations of motion read 


4 . it - F. 

dt dxj * 


( 2 ) 


where Fi = —Ujdui/dxj+l/Re d 2 Ui/dx 2 . 

The first step is to split the velocity into a sum of predicted and corrected values. The predicted 
velocity is determined by time integration of the momentum equations without the pressure term 


u? +1 =:u? + AtF? ( 3) 

The second step is to develop the pressure and corrected velocity fields that satisfy the continuity 
equation by using the relationships 


u 


n+l _ -n+I 
i u i 


At 


dp 

dxi 


dxi 


= 0 


(4a) 

(4b) 


Here the superscript n denotes the n-th time step. Note that the size of a stable time-step At can be 
increased by using an adaptation of Runge-Kutta techniques [6] for the high Reynolds number and 
the Stokes solution for the low Reynolds number [3], respectively. 

An equation for the pressure can be obtained by taking the divergence of Eq. (4a). In view of Eq 
(4b), it governs 

d 2 p _ 1 dui 
dx] At dxi 

Note that whenever solving Eq. (5) the identity of Eq. (4a) should be utilized to absorb the given 
boundary conditions of the velocity components [7]. 

If p satisfies Eq. (5), then u n+1 does indeed satisfy Eq. (4b). The solution of the pressure equation, 
Eq. (5), is the most computationally expensive step, while in Cartesian coordinates it can be directly 
solved numerically by the separation of variables [7]. Eq. (5) is of the general form, 


Lp = S ( 6 ) 

for some linear operator L on some finite dimensional vector space. The properties of the operator 
L depend on the methods chosen to represent the fields and their derivatives. 

Let the pressure p and source term S in Eq. (6) be expanded in a series of eigenfunctions such 
that 


p = EXpEY T EZ T 
S = EXSEY T EZ r . 


(7a) 

(7b) 


then the solution of three-dimensional pressure Eq. (5) can be reduced to the simplest algebraic form 


(c*j + fij + 7 k)Pi,j,k — ‘S’iJ.A; 


( 8 ) 
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where and 7fc are the eigenvalues with respect to the discrete derivative matrices of the linear 

operator’ L, and EX,EY,EZ are the corresponding eigenvectors associated with each eigenvalue. 
However, eigenvalues may not be real due to the complexity of an operator L. Without putting any 
restriction on eigenvalues, complex eigenvalues and their associated eigenvectors are permitted if the 
pressure gradient at the imaginary part vanishes. This is true because only the pressure gradient 
drives flow instead of the pressure itself. However, the effort of the matrix multiplication will be 
increased by a factor of four if all the calculations of Eqs. (7) are performed by the purely complex 
variables. Fortunately, not exceeding a factor of two will be reached if one takes advantage of (1) e 
purely real part of eigenfunctions for matrix multiplication; (2) the source term 5 being real; and (3) 
choosing only the real part of pressure as the pressure solution. The way for (1) includes reordering 
the eigenfunction into two parts: real versus complex, and similarly for the eigenvalues. 

The iterative preconditioned method for the solution of pressure in the curvilinear coordinate 
system can be found in [8]. Note that if there are N degrees of freedom in each direction the overall 
memory required for finding the solution to the pressure equation in three dimensions is 0{N ). this 
is the same type of maximal storage efficient scaling that we have for the velocity field. 

Viewing the solution of the Navier-Stokes equations by the splitting method, two steps account 
for most of the run time, predicted velocity and the pressure solution. The bulk of these two steps 
can be concisely described in terms of dot products and matrix multiplication between subsets o 
array. Importantly, no data dependency occurs when running programs on parallel machines. 


3 Domain Docomposition with IVlulti-Grid. SAP 

The solution of the Navier-Stokes equations via the domain decomposition approach consists of first 
dividing the computational domain into a number of subdomains with inter-overlapping areas, w ere 
the grids inside the overlapping area may or may not be located at the same places. Next imple- 
ment the SAP for exchanging data among subdomains, i.e., solving the problem on each subdomain 
separately and then updating the velocity field on the overlapped interfaces. The advantages of this 
approach include (i) less memory access, local rather than global, and (n) easy treatment of comp ex 

^^Iddition to the Lagrangian constraint between the pressure and velocity fields, the noncoinci- 
dent overlapped grids in the inter-overlapping areas among subdomains even enhance the difficulty 
of applying the multi-grid technique. However, the idea of “coarse-grid correction is still effective 
to reduce the high frequency error from the interpolated residual of the fine-grid subdomain. T e 
strategy behind the coarse-grid correction process is to adopt the idea proposed by Thompson and 

Ferziger [9] and is modified as 

V c • u c - V c • (l/u/) = 1/(17 - V/ • U/) ( 9 ) 

Here V c - represents the operator of divergence on the coarse-grid subdomain. 1/ is an interpolation 
operator from the fine-grid subdomain / to coarse grid subdomain c, and u is the velocity componen . 
r f is simply the result of the divergence of the velocity field which should be set to zero. The left hand 
side of Eq. (9) is the difference between the coarse-grid operator acting on the coarse-grid subdomain 
and the coarse-grid operator acting on the interpolated fine-grid subdomain (which is held fixe ). 
The right hand side of Eq. (9) is the interpolated residual of the fine-grid subdomain. It is obvious 
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that once the solution of the fine-grid subdomain has been found the residual will be zero (exactly 
satisfying the pressure equation), and it also implies 

u c = //u, (10) 

When the residual is non-zero, Eq. (9) acts as a forcing term for the coarse-grid correction to transfer 
the correction of the velocity field back to the fine-grid subdomain, i.e., 

u n f ew = u° f ld + r f (u c - itu 0 /*) ( 11 ) 

This is vital for the success of the scheme. Changes in the velocity field are transferred back to the 
fine-grid subdomain rather than the velocity field itself. Notice that when the overlapped grids in 
the overlapping areas are collapsed at the same places the interpolation operator 1/ automatically 
becomes a unitary matrix. 

The multi-grid SAP iterative solution of the incompressible Navier-Stokes equation in primitive 
variable form for a driven cavity flow sketched in Fig. 1 is summarized by the following algorithm: 

1. First assume u n+1 on AB, Usually u n will be a good initial guess. 

2. Solve fine-grid domain II employing the boundary conditions derived from the divergence of 
the velocity field, including on AB , where the pressure solution is directly obtained by the 
eigenfunction expansion method. 

3. With the interpolated solution of u n+1 from step (2) on domain Illc I, solve coarse-grid domain 
I employing the same type boundary conditions including on CD to update u n+1 on domain 
III C II by the coarse-grid correction process. 

4. Repeat steps (2) & (3) until the velocity u n+1 on AB, CD does not change. 

In order to guarantee that consistent values of velocity (or pressure gradient) be generated in the 
overlapping domains III, satisfying Eq. (10), the divergence of the velocity field V • u needs to be 
actually computed in whichever domain I or II is counted [4]. Since u on domains III is not known 
a priori, the divergence of the velocity field is only set to zero at the first SAP iteration for step (2). 
According to this approach, the continuity equation is satisfied on domains II (including III C II) 
and I (excluding III C I), which is revealed from Eq. (9) that the continuity equation is only satisfied 
on the fine-grid domain II. More specifically, the issue of how to satisfy the continuity equation along 
the inte rface s of fine-coarse grid domain can be easily resolved by the proposed approach, namely, 
V • u on AB satisfied on the coarse-grid domain I, and V • u on CD satisfied on the fine-grid domain 
II. However, the error index of the continuity equation on domain III C I will indicate how good the 
interpolation is (affected by the layout of overlapped grids) and whether any steep change of flow 
field exists. 

Three main issues occurring in the overlapping area between the fine-grid and coarse-grid subdo- 
mains one might often encounter are how to (1) efficiently implement the interpolation; (2) adequately 
represent the predicted velocity; and (3) explicitly impose the global mass conservation. Each will 
be addressed separately. 
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3.1 Data interpolation 

Finding the image (£,??), -1 < £ < 1, -1 < ?? < 1, of a collocation point (x,y) from the fine-grid 
subdomain II mapped onto the coarse-grid subdomain I (or vice versa) is first determined by using 
the two-dimensional Lagrange interpolation to seek its corresponding position falling into an element 
on the coarse-grid subdomain that contains ( M + 1) x (N + 1) collocation points, £; = cosm/M(i = 
0 , = cosTrj /N(j = 0 such that 

M N 

x = Y1 a mnT m {i)T n {l l) 

771—0 n — 0 
M N 

y=E Ei-J-KWi) 

771-0 71—0 

where T m denotes the with order Chebyshev polynomials. Unknown expansion coefficients, &rnm 
can be easily obtained by the prescribed points (x,y) on the coarse-grid subdomain I through 

M N 

= E E a mnTm(^i)T n (rjj) 

m =0 n =0 
M N 

” E Y, b rnnT m (ti)Tn{T)j) 

7 n =0 n =0 

With a given point (x,y) in the physical space of fine-grid subdomain II, its image (£, 77 ) on the 
coarse-grid subdomain I can be iteratively solved by the Newton-Raphson method. Once the one- 
to-one correspondence between the fine-grid and coarse-grid subdomains has been established, the 
equation required to generate a function <^(x,y) on the fine-grid subdomain interpolated from the 
coarse-grid subdomain, now becomes 

M N 

A*,y) = E E mmiWlhrn) (14) 

t=0 j=0 

where 77 j) denotes the function value at the collocation point (£*, rjj) on the coarse-grid subdo- 

main, and iVi(£), lVj( 77 ) are the shape functions defining the geometry of the element on the coarse-grid 
subdomain, whose expressions are 


(13a) 

(13b) 


(12a) 

(12b) 


M 

N,(() = ) (15a) 

m=0 

Nfa) = (15b) 

71=0 

where the matrices T m (£) and T m (£) are the Fourier cosine series and their inverse [7]. Note that the 
shape functions Ni((),Nj(r}) satisfy the Kronecker-delta property, i.e., Ni(£ m ) = = 6 jn . 

Be aware that it requires much less effort to perform the data interpolation if the one-to-one 
correspondence for the shape functions between subdomains can be stored (once and for all). Also 
the cost for such additional memory is negligible compared to that declared by a single variable. 
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3.2 Predicted velocity 

Since the predicted velocity in the overlapping area generated from Eq. (3) by the fine-grid subdomain 
is slightly different from that obtained by the coarse-grid subdomain, how to control the predicted 
velocity in order to keep the error index, £2 norm of uj = ||u c — J/u/|| minimal, is of great importance. 
Numerical experiments suggest that the following dynamic relationship 

Uf = (1 -w a ) u/ + u a Iju c (16) 

gives the best fit. Here the exponent a is chosen as 0.4 for various tested problems. 


3.3 Global flow rate 


For the inflow-outflow problems the coarse-grid velocity field interpolated from the fine-grid subdo- 
main may not exactly satisfy the global mass conservation, and a slight adjustment to the velocity 
field is imperative. A common-used formula will meet such a requirement, i.e., 


u™ w = u 


exact 




■dA 




■dA 


(17) 


4 Results and Discussion 

For the numerical test of the driven cavity flow problem, layout of elements (6 points per element) 
in the fine-grid and coarse-grid subdomains at the Reynolds number of 400 and 100 are displayed 
in Figs. 2a and 2b, respectively. The overlapping area is not explicitly shown in the figures, but 
just imagine the extension of one more element from the coarse-grid subdomain into the fine-grid 
subdomain. The layout of elements is in accordance with the requirement to resolve the steep 
changes inside the boundary layers. When exchanging the data through the interpolation in the 
inter-overlapping area, the high frequency error introduced by the fine-grid subdomain will pollute 
the results throughout the whole computational domain. It can be simply proved by checking the 
error index, £ 2 norm of uj = ||u c — l/u/||, in the overlapping area, uj will increase with marching 
in time domain, and eventually become an unreasonably big number under which the solution does 
not exist. With the multi-grid SAP approach, both results produce O(10 -4 ) for u, instead. When 
comparing the streamline plots, xjj m i n = -0.1055 for Re = 100 and V’min = -0.1163 for Re = 400, with 
the most accurate results of Ghia [10], good agreements can be observed in Fig. 3. 

For the inflow-outflow jet problem, a nozzle is designed to gain a high speed fluid with a smooth 
change of the convergent channel. The configuration of jet flow is plotted in Fig. 4. A jet emanating 
from the nozzle with an aspect ratio H/D = 144 (the width of tank to nozzle) is used to understand the 
turbulent characters through the direct numerical simulation. With a strong stratification imposed 
in the vertical direction the two-dimensional turbulent flow calculation will be a good approximation 
to the three-dimensional case. The calculation is carried out up to the Kolmogoroff length scale 
where the energy transferred from the large scales is in equilibrium with the energy dissipated in the 
smallest scale by the molecular viscosity. Certainly, for the purpose of direct numerical simulation 
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the Reynolds number should not be large so that the machine can still handle the huge number of 
points required for the resolution of different length scales. The computational domain is decomposed 
into three subdomains: the upstream nozzle where the inflow is developed to gain a high speed, the 
immediate downstream from the exit of the nozzle where the high speed jet is discharged into the 
tank, and the far downstream (fine grids) where a well-developed turbulent flow can be traced. 

Let us first check the error index u> for the inflow-outflow jet problem without using the multi-grid 
SAP technique. The uj around O(10 -3 ) seems all-right at Re = 100 initially, but the onset of noise 
starts to destabilize the downstream flow field at the Reynolds number of 250 and u increases up 
to 0(1O -2 ). That clearly demonstrates the high frequency polluting that results on the fine-grid 
subdomain, but the noise can be totally removed by using the multi-grid SAP technique. Fig. 5 
depicts the streamline plot of jet flow at Re = 250. During the time evolution of the jet flow, the 
symmetry of the jet front will not be distorted at the early stage (Fig. 5a) until the phase speed of 
vortex shedding (due to flow instability) travels faster than that of the jet front. As seen in Fig. (5b), 
a pair of vortices adjacent to the jet front persisting throughout the course represent the extrusion 
of the jet into the ambient fluid. Once the jet front is caught up by the incoming travelling waves, 
the energy transferred by the vortex shedding, in terms of the cascade process from the highest at 
the nozzle exit (high shedding frequency) to the lowest at the jet front (low shedding frequency), 
splits into two parts, one for the jet front to push against the ambient viscous resistance, another 
for the vertical motion. The intensity of vertical motion behind the jet front is gradually enhanced 
as visualized by the splitting streamlines, and their patterns move backward toward the exit of the 
nozzle where a distinct pair of vortices exist. The appearance of similar pairs of vortices can also be 
confirmed by the experiment at the high Reynolds number [11]. 


5 Conclusions 

The solution of the Navier-Stokes equations in a primitive variable form has been solved by the pseu- 
dospectral element method via the multi-grid domain decomposition technique. The computational 
domain is divided into a number of simple subdomains with the inter-overlapping zone, of which the 
fine grids (or fine-grid subdomain) are used in the areas with the steep change of flow field while the 
coarse grids (or coarse-grid subdomain) are used in the others. During the data exchange among 
subdomains, the coarse-grid correction technique is used to eliminate the high frequency error caused 
by the data interpolation from the fine-grid subdomain to the coarse-grid subdomain. 

Both driven cavity and jet flow demonstrate the versatility of the proposed multi-grid method. 
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SUMMARY 


We describe a multigrid multiblock method for compressible turbulent flow simulations and present 
results obtained from calculations on a two-element airfoil. A vertex-based spatial discretization 
method and explicit multistage Runge-Kutta time-stepping are used. The slow convergence of a 
single grid method makes the multigrid method, which yields a speed up with a factor of about 
20, indispensable. The numerical predictions are in good agreement with experimental results. It 
is shown that the convergence of the multigrid process depends considerably on the ordering of the 
various loops. If the block loop is put inside the stage loop the process converges more rapidly than 
if the block loop is situated outside the stage loop in case a three-stage Runge-Kutta, method is used. 
If a five-stage scheme is used the process does not converge in the latter block ordering. Finally, the 
process based on the five-stage method is about 60% more efficient than with the three-stage method, 
if the block loop is inside the stage loop. 


INTRODUCTION 


Numerical simulations of turbulent flow in aerodynamic applications are frequently based on the 
Reynolds-averaged Navier-Stokes equations. One of the relevant problems in aeronautics is the pre- 
diction of flow quantities in complicated geometries, such as the multi-element airfoil (see figure 1). 
The simulation of turbulent flow around such a multi-element airfoil configuration was one of the 



Figure 1: Geometry of a two-element airfoil. 

applications selected for the compressible flow solver which was developed by our group and NLR 
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as a part of the Dutch ISNaS project [1]. For this application the use of a single-block, boundary- 
conforming, structured grid is impossible and one may select either an unstructured grid approach 
or a block-structured grid approach. Although the former technique has been successfully applied by 
others [2], we selected the block-structured approach in view of the transparent data structure in the 
coding, ease of implementation of the turbulence model and a high flexibility with respect to the use 
of different physical models in different parts of the computational domain. 

In a previous paper [3] it has been shown that for laminar and turbulent flow around a single airfoil 
the introduction of the multiblock structure has no influence on the results, with respect to both the 
steady-state solution and the convergence rate. Furthermore, invoking the Euler equations instead 
of the Navier-Stokes equations in blocks outside the boundary layer appeared to have no significant 
influence on the results. In this paper we describe the application of the multiblock concept to 
the multi-element airfoil. If the Euler equations are used throughout the computational domain, 
a converged steady-state solution is obtained within a reasonable calculation time. However, if the 
Reynolds-averaged Navier-Stokes equations are solved in the boundary layers, the rate of convergence 
is unacceptably low. Therefore, a multigrid technique was implemented in order to accelerate the 
convergence. The resulting gain in calculation time is close to a factor of 20, and the converged 
solution is in good agreement with wind-tunnel measurements. 

In section 2 the numerical technique, which is based on a combination of a finite volume method 
with central spatial differencing and a Runge-Kutta explicit time-stepping method, is described. The 
results, both for inviscid and for viscous simulations, are presented in section 3. Finally, in section 4 
some conclusions are summarized. 


NUMERICAL METHOD 


In this section we describe the numerical method used in the flow solver. The two-dimensional, 
compressible Navier-Stokes equations can be written in integral form as 


d_ ’ 
dt . 


f f Udxdy + / ( Fdy — Gdx ) 
J J Jan 


= 0 , 


( 1 ) 


where U represents the vector of dependent variables, 

U = [p,pu,pv,E] T , ( 2 ) 

with p the density, u and v the Cartesian velocity components, and E the total energy density. 
Further, 0 is an arbitrary part of the two-dimensional space with boundary 3D and F and G are 
the Cartesian components of the total flux vector. This flux vector consists of two parts: the non- 
dissipative or ’convective’ part and the dissipative or ’viscous’ part, which describes the effects of vis- 
cosity and heat conduction, and involves first order spatial derivatives. The Navier-Stokes equations 
(1) are averaged over a sufficiently large time interval. Due to the nonlinear terms in the convective 
fluxes, the resulting ’Reynolds-averaged Navier-Stokes’ equations involve averages of products of two 
velocity components. These terms are modeled by a suitable turbulence model. In the present paper 
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the algebraic Baldwin-Lomax turbulence model, in which the unknown terms are modeled by eddy 
viscosity terms, is adopted [4]. 

The discretization of the Navier-Stokes equations follows the method of lines, i.e. the spatial 
discretization is performed first, and subsequently the resulting set of ordinary differential equations 
is integrated in time, until the steady state solution is approximated. First the computational domain 
is divided into blocks and each block is partitioned in quadrilateral cells with the help of a structured, 
boundary-conforming grid. The variables are stored in the grid points. A finite volume method is used 
in which the integral form of the Navier-Stokes equations is applied to a control volume fi, bounded 
by the dashed lines in figure 2. The convective flux through a boundary of this control volume is 



Figure 2: Control volume in the vertex-based method. 

approximated using the value of the convective flux vector in the midpoint of the boundary. The 
latter is calculated by averaging over the two neighboring grid points. The viscous flux vector involves 
spatial derivatives of the state vector U and is approximated in the corner points of the control volume 
with the use of Gauss’ theorem on a grid cell. The viscous flux is subsequently calculated using the 
trapezoidal rule. This method is called the vertex-based method. 

The method of central differencing leads to a decoupling of odd and even grid points and to 
oscillations near shock waves. Even in viscous flow calculations the presence of the viscous dissipation 
is insufficient to damp these instabilities outside shear layers. Therefore, nonlinear artificial dissipation 
is added to the basic numerical scheme. This artificial dissipation consists of two contributions: fourth 
order difference terms which prevent odd-even decoupling, and second order difference terms to resolve 
shock waves. The second order terms are controlled by a shock sensor, which detects discontinuities 
in the pressure. In the present flow solver the artificial dissipation in the boundary layers, where 
the viscous dissipation should be dominant, may be reduced by multiplication with the ratio of the 
local and free-stream Mach number. The role of the artificial dissipation in relation to the viscous 


307 


dissipation is discussed in more detail in reference [5]. 

At the solid wall boundaries the no-slip condition is used. The density and energy density m the 
grid points on a solid wall are calculated by solving the corresponding discrete conservation laws, 
using the two adjacent cells within the computational domain and their mirror images inside the wall 
as the control volume. The values of the density and energy density in the grid points inside the walls 
are adjusted such that the adiabatic wall condition is approximated. The boundary conditions at 
a (subsonic) far-held boundary are based on characteristic theory. The extent of the computational 
domain can be reduced without affecting the accuracy if a vortex is superimposed on the incoming 
free stream outside the computational domain [6]. 


Due to the topology of the two-element airfoil geometry, special points in the computational grid are 
unavoidable. The computational grids used contain two special points at block boundaries, where hve 
cells meet (see figure 4). These points can be treated in an elegant way within the same numerical 
scheme if the dummy vertices outside the ’current’ block are defined appropriately. The multi- 
valuedness of the variables at the special point, caused by this asymmetric treatment is eliminated 
by taking the average of the hve different values after all blocks have been treated. This is sketche 

in figure 3. 



The system of ordinary differential equations, which results after spatial discretization, is integrate 
in time using a time-explicit multistage Runge-Kutta method. In the present how solver a three-stage 
scheme in which the dissipative fluxes (both viscous and artihcial) are calculated once per time-step, 
and a hve-stage scheme in which the dissipative terms are calculated only at the odd stages are 
implemented. With this treatment both calculation time is saved and the stability region of t e 
method is increased. Extra calculation time is saved by advancing each grid point at the maximum 
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local time-step according to its own stability limit. In this way the evolution from the initial solution 
to the steady state is no longer time accurate, but the steady state solution obtained is unaffected. 

The above time-stepping method acts as the relaxation method and coarse grid operator in the 
multigrid solver (see reference [6]). In this solver an initial solution on the finest grid is obtained with 
a full multigrid method. This initial solution is corrected in the FAS-stage, where either V- or W- 
cycles can be chosen. A fixed number of pre- and post-relaxations is performed before turning to the 
next coarser or finer grid. The solution is transferred to a coarser grid by injection, the residuals by 
full weighting and the corrections to the solution are prolonged by bilinear interpolation. In order to 
increase the smoothing properties of the Runge-Kutta time-stepping technique an implicit averaging 
of the residuals is applied with frozen residuals at the block boundaries. For mono-block applications 
this method has given satisfactory results for both two-dimensional and three-dimensional flows [5]. 

In the multi-element airfoil application care has to be taken in the definition of the residual-vector 
in the special points. The proposed treatment of a special point implies that the control volume is 
different in each of the five blocks where such a point is found. In the required averaging the five 
residual-vectors in a special point are weighed with their corresponding time-steps. Without this 
weighing the multigrid process cannot converge to the single grid stationary state solution. 

In this multigrid, multiblock solver with a multistage time-stepping method there are various 
possibilities for intertwining the different loops. In the present study the grid loop is chosen as the 
outer loop and the effect of interchanging the block and the stage loop will be studied. Several 
’competing’ requirements serve as possible guidance for selecting a specific ordering of these loops. 
On the one hand an anticipated parallel processing of the different blocks is more efficient, if the 
data transfer between the blocks is kept to a minimum, i.e. with the stage loop inside the block 
loop. On the other hand the good convergence of the multigrid mono-block solver may be reduced as 
the dummy variables near the block boundaries are kept frozen during more stages of the time-step. 
This would suggest to put the block loop inside the stage loop. In order to study this dilemma we 
implemented these two loop orders in a flexible way: a single parameter determines whether the block 
loop is situated inside or outside the stage loop. 


RESULTS 


Description of the test-case 


We will present results for a two-component airfoil geometry consisting of the NLR7301 wing 
section, from which a flap has been cut out at a deflection angle of 20° and with a gap width of 2.6% 
chord length [7] (see figure 1). The combination of a Mach number of 0.185 and an angle of incidence 
of 6° or 13.1°, of which the latter is close to maximum lift conditions, yields subsonic flow. The 
Reynolds number based on the chord length of the airfoil is 2.51 x 10 6 . In the viscous calculations 
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the locations of the transition from laminar to turbulent flow are prescribed. 

The C-type computational grids (either for inviscid or viscous flow) were constructed by J.J. Benton 
from British Aerospace, and are subdivided in 37 blocks (see figure 4). The grid lines are continuous 
over block boundaries. Two grids are used: one ’Euler’ grid (inviscid) consisting of 16448 cells, and 
a ’Navier-Stokes’ grid (viscous), which is refined in the boundary layers and wakes and consists of 
28288 cells. 



For both angles of incidence results from wind-tunnel measurement by Van den Berg [7] are avail- 
able, including velocity profiles in the boundary layers and the pressure coefficient on the profile. 
Since the flow is attached apart from a small laminar separation bubble near the leading edge of 
the wing, the adopted turbulence model should be adequate and yield a useful comparison between 
experiment and calculation. 


Inviscid Flow 


In order to test the flow solver on the complicated block structure of the two-element airfoil geom- 
etry, we considered the relatively simple inviscid flow case, where in all blocks the Euler equations are 
solved. In this way problems related to the turbulence model are separated from possible algorithmic 
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problems. The use of the Euler equations implies that the boundary conditions at the solid wall 
boundaries have to be changed. For inviscid flow there is only one physical boundary condition of 
zero mass flux through the wall. In the vertex based approach the density, the pressure and the 
tangential velocity at the wall are approximated by linear extrapolation. 

In figure 5 the multigrid convergence behavior of the solver in the 13.1° case is shown. The discrete 
L 2 -norm of the residual of the density is plotted as a function of the number of W-cycIes. A converged 
solution is obtained within a much smaller calculation time when compared to the single grid approach 
even though only three different grid levels are available. Both for the single grid and the multigrid 
calculations machine accuracy was obtained. The specific block structure nor the treatment of the 
special points leads to any specific difficulties. For this inviscid test a comparison with experimental 
results is not meaningful and will not be made. 



Figure 5: Convergence behavior for inviscid flow at an angle of incidence of 13.1°. 


Viscous Flow 


We consider the simulations of turbulent, viscous flow and present results for the 6° case only. 
Single-grid calculations in which only local time-stepping is applied as a convergence acceleration 
technique yield a steady-state solution which is in good agreement with the experimental results. 
However, in contrast with a fully inviscid simulation, the rate of convergence is very small, and 
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renders this method unacceptable for practical applications. Therefore, as a method to increase 
the convergence rate further, the multigrid technique and implicit residual averaging as described in 
section 2 are indispensable. 

In a simulation of turbulent flow at high Reynolds number it is important that the effects related 
to the physical dissipation are not outweighed by those of the numerical or artificial dissipation. This 
requirement could give rise to difficulties in the present multigrid method, since the time-stepping 
method used requires a certain minimum amount of dissipation for sufficient smoothing of the large 
wave-number components of the error (see reference [5]). If the artificial dissipation in the boundary 
layer is reduced by scaling with the ratio of the local and free-stream Mach number, i.e. decreasing 
the smoothing properties of the time-stepping method, a converged solution (engineering accuracy) 
could be obtained by increasing the number of pre- and post-relaxations. The convergence behavior of 
this calculation during the FAS stage is shown in figure 6, where the discrete T 2 -norm of the residual 
of the density is plotted as a function of the number of W-cycles. In the blocks outside the boundary 
layers and wakes the Euler equations are solved instead of the Navier-Stokes equations. The good 



Figure 6: Viscous flow at an angle of incidence of 6.0°: convergence behavior 

agreement with the wind-tunnel measurements can be inferred from figure 7, where the experimental 
and numerically predicted pressure coefficients on the airfoil and flap are shown. 

This solution was obtained wuth the block loop inside the stage loop of the five-stage Runge-Kutta 
time-stepping method. Hence, the variables at the dummy vertices outside a block are updated 
after every stage, which implies that the effects of the multiblock structure on the convergence are 
kept to a minimum. The frequency of data transfer between the blocks makes this method less 
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Figure 7: Viscous flow at an angle of incidence of 6.0°: comparison of the pressure coefficient on the 
airfoil between calculation (solid) and experiment (dashed). 

efficient for parallel processing. However, with the block loop outside the stage loop, i.e. with an 
update of the dummy variables only after five flux evaluations, a converged solution could not be 
obtained. Apparently, the interval between two moments of data transfer between the blocks has to 
be sufficiently small in order to obtain a convergent multigrid method. 

Further evidence for this statement is obtained from calculations with a three-stage instead of a 
five-stage Runge-Kutta time-stepping method. If the block loop is outside the stage loop, the dummy 
variables are updated after three flux evaluations. Although the rate of convergence is lower than 
in the case with the loops interchanged (see figure 8), the solution has converged within engineering 
accuracy after ~ 200 W-cycles. A comparison of the three-stage and five-stage schemes with the 
block loop inside the stage loop shows that the five-stage scheme is more efficient: about 60 W-cycles 
suffice to get the residuals at the same level as with the three-stage scheme after 200 W-cycles. The 
five-stage scheme leads to a reduction in calculation time of approximately 60% in this instance. 


DISCUSSION 


We presented simulation results obtained with a multigrid multiblock method for a two-element 
airfoil. Both viscous and inviscid calculations were performed using the same multigrid process 
and the same vertex-based spatial discretization method. Moreover, either a three- or a five-stage 
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Tigure 8. Convergence behavior of the three-stage Runge-Kutta scheme for turbulent flow; comparison 
between block loop inside (solid) and outside (dashed) stage loop. 

Runge-Kutta scheme was considered for the integration in time and the smoothing properties of this 
relaxation method were further enhanced through the introduction of local time-stepping, implicit 

residual averaging in which the residuals at the block boundaries were kept fixed to their non-smoothed 
values. 


The inviscid calculations have shown that a solution which is converged up to machine accuracy can 
be obtained with this multigrid method. A comparison with the single grid simulation method shows 
that a considerable reduction in calculation time was obtained with the multigrid method, although 
the convergence of the single grid method for inviscid calculations was already quite acceptable. We 
also investigated two different numerical boundary conditions at the solid walls. It appeared that 
linear extrapolation of the pressure not only leads to a better convergence than constant extrapolation, 
but also gives rise to a much smaller entropy layer around the airfoil. The resulting drag coefficient, 
which theoretically should equal zero in this subsonic flow, is reduced by almost 60%. 


In the viscous calculations the single grid method was found to yield a well converged result in the 
6°-ca.se, however, the convergence towards the steady state solution was extremely slow and makes 
the use of a multigrid approach essential. A comparison of the calculation times required in both 
methods shows that a total speed-up with a factor of about 20 can be reached. The numerical 
predictions obtained for the lift- and pressure coefficients compare well with experimental results 
and give confidence in the use of the Baldwin-Lomax model for this application. The convergence 
of the multigrid process was studied in detail, showing that the ordering of the various loops in the 
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process has a considerable effect. Interchanging the block and stage loops and keeping the grid loop as 
the outer loop, yields an optimal convergence when the block loop is put inside the stage loop. If the 
stage loop is put inside the block loop then convergence of the multigrid process was absent when 
using the five-stage Runge-Kutta method as the relaxation method. Apparently, the smoothing of 
the relaxation method becomes less effective as the number of stages between two ’updates’ of the 
dummy- variables increases. This result has some less favorable consequences in view of a possible 
parallel processing of the multigrid method. On the one hand parallel processing seems more efficient 
if the frequency of data transfer between the blocks can be reduced. On the other hand the reduction 
of this frequency results in a reduction of the convergence rate of the multigrid process, and in some 
instances even to an absence of convergence. This suggests that in a possible parallel processing of 
this multigrid method, an optimal rate of data-exchange between the blocks should be determined. 
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For second-order elliptic boundary value problems, we develop a nonconforming multigrid 
method using the coarser-grid correction on the conforming finite element subspaces. The 
convergence proof with an arbitrary number of smoothing steps for V-cycle is presented. 


1. INTRODUCTION 


Let 0 be a convex polygon in R 2 . Let / 6 L 2 (D), a £ C^fl) and £ C°(fi). We assume there 
exists ao such that a > ao > 0 and /? > 0. In this paper we discuss convergence properties of the 
multigrid method for solving the Dirichlet problem 

— V • (aVu) -f /?it = / in ft, (1) 

u = 0 on dCl , (2) 

using PI nonconforming finite elements(see [5, 6]). 

The prototype of the multigrid convergence theory is that 

For some number of smoothing steps the multigrid process is a contraction for some 
norm. Moreover, the contraction number is independent of the mesh size h. 

This was proved for conforming multigrid methods by Bank and Dupont [1]. Braess and 
Hackbusch[2] and Hackbusch[8] proved this for the V cycle with one smoothing step. For the 
nonconforming multigrid method, this was proved by Braess and Verfurth[3] and Brenner[4] for the 
W-cycle under the condition that each iteration step contains many smoothing steps. 

The method presented in this paper consists of a smoothing step on the nonconforming finite 
element space of the finest-grid and correction step which is obtained by the conforming multigrid 

‘This research was partially supported by the National Science Foundation under Grant No. CDA- 
9024618 and DMS-9203502. 
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method on the conforming finite element subspaces of coarser-grids. The standard nonconforming 
multigrid which was proved by Brenner in [4] is based on smoothings and correction on the 
nonconforming finite element spaces. The important difference is that V*_i % V k and W k - 1 Q Vfc, 
where V k and W k are the nonconforming and conforming finite element spaces on mesh level k, 
respectively. Hence we can simply use the natural injection for the intergrid transfer of grid 
functions and this intergrid transfer operator preserves the energy norm. Moreover, the error of the 
coarser-grid correction is orthogonal to W k -i. Owing to these, the standard proof of convergence m 
[2] for the V-cycle of one smoothing step of the conforming multigrid method carries over directly. 
In [3] Braess and Verfiirth added the step length parameter in the correction step of the standard 
nonconforming multigrid algorithm to improve the convergence. They proved the convergence of 
two-level case of this modified standard nonconforming multigrid with one smoothing step. The rate 
of convergence of their algorithm should be better than or at least equal to that of the standard 
nonconforming multigrid method but it needs more cost for each iteration. While Brenner prove 
the convergence of the standard nonconforming multigrid algorithm only for the W-cycle it is 
convergent for the V cycle with one smoothing step in real computation. Also the modified 
standard nonconforming multigrid algorithm converges for the V cycle with one smoothing step in 
real computation. Our multigrid method is easier to implement and more effective because it needs 
fewer computations and communications in a parallel sense. These computations were done in 
CM-5 Vector Units*. 

This paper is organized as follows. In Section 2 we discuss the fundamental estimates from the 
theory of finite elements and the intergrid transfer operator. The multigrid algorithm is discussed 
in Section 3. Section 4 contains the contracting properties of the fc-level iteration. In the last 
section we compare the computational results of three algorithms. 


2. THE FINITE ELEMENT SPACES 


The variational formulation for (1) and (2) is defined as follows: Find u € #o (ft) such that 

a{u,v) = F(v) VveHKSl), 

where . r 

a(u,v) = J^(aVu • Vu + fluv) and F(v) = J^fv . 

Here, JTj(fl) denotes the usual Sobolev space (see [5]). 

Let {T fc }, k > 1, be a family of triangulations of ft, where T fc+1 is obtained by connecting the 
midpoints of the edges of the triangles in T k . Let h k := max TeT i diamT, then h k = 2h k+ \. 
Throughout this paper, C denotes the positive constant independent of k which may vary from 
occurence to occurence even in the proof of the same theorem. 

t These results are based upon a test version of the software where the emphasis was on providing functionality 
and the tools necessary to begin testing the CM5 with vector units. This software release has not had the benefit of 
optimization or performance tuning and, consequently, is not necessarily representative of the performance of the full 
version of this software. 
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It is worth pointing out the motivation of the nonconforming finite elements. In the stationary 
Stokes problem for an incompressible viscous fluid, it is realized that a major difficulty exists in the 
numerical treatment of the incompressibility condition. Crouzeix and Raviart in [6] advocated the 
method that the incompressibility condition is approximated. They have found it very convenient 
to use nonconforming finite elements for this purpose. By Uzawa’s method the Stokes equation is 
reduced to a sequence of Dirichlet problems for the operator —A. Thus we shall develop a 
nonconforming multigrid method for solving (1) and (2). 

Now let’s define the nonconforming finite element space 

Vk '■= {u : u|jis linear for all T G X*, u is continuous at the midpoints 
of the edges and u = 0 at the mid points on 3fl} . 

Note that functions in Vk are not continuous. 

We also use a conforming finite element space for our multigrid method NC-CMG. Define 

Wk := {ru : w\t is linear for all T G T k , w is continuous 

on Q and u>|an = 0} . 

The space Vk will be used in the finest-grid space and Wk in the coarser-grid spaces to obtain 
NC-CMG. Observe that W k = V k D Hq(Q,) = V k 7T V k +i. 

For each k, define (on Vk + Hq(Q)) 

a *( u > u ):= ^ / (a:Vu • Vu + /?uu) 

Ter* Jt 

and the energy norm induced by a* 

IMU := y/a k (u,u). 

The bilinear form <!&(•, •) is symmetric and positive definite on VJt. Moreover, we have the inverse 
estimate[4] 

IM|*<cViM| iS Vu e v*. (3) 

We also note that if u,u G tfd(fl), then a k (u,v) = a(u,v). 

We now recall some fundamental estimates from the theory of finite elements. 

Since / G L (D), elliptic regularity implies that u G i? 2 (fi)(see [7]). For the same /, let Uk G V* 
satisfy 

ak(uk,v ) = [ fv Vu G V k 

JQ 

and let u* G Wk satisfy 

ak(uk,v) = I fv Vu G W k . 

Jq 

Since Vk satisfies the patch test(see [11]), we have the following estimate for the discretization error: 

||u — Ufc||x2 + hk\\u — (4) 
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(see [6]). The estimate for the conforming descretization error is, of course, well known(see [5]). 

|| u - ttfc||z,J + h k \\u - Ufc||fc < Ch\\\u\\ H 2 . (5) 


From the spectral theory, there exist eigenvalues 0 < Ai < A 2 < ■ • • 5* A„ fc and eigenfunctions 
V’lj V>2, • ■ • , V’n* € V fc , ^j)t* = ^ij (= the Kronecker delta), such that a k (ipi,v) = A i(ij>i,v ) L 2 for 
all u 6 Vfc. From the inverse estimate (3), there exists C > 0 such that 

Ai < Chf. (6) 


The same results hold for the conforming finite element spaces. The norm |u|#,* is defined (see [1]) 
as follows: 


/ 71 fc \ n k 

v| t| * := ( £ X\v\ J where v = £ Mi £ V k . (7) 


u=i 


Moreover, 


IMIlo,* = HU 2 and Mi,*=|MU- 


( 8 ) 


And, the Cauchy-Schwarz inequality implies 

|a fc (u,^)| < Ivli^Hli-t,* 


for any t G R and v, w G V k . 

For v G Vfe-i the intergrid transfer operator I*_i : Vjt-i — + V k is defined as follows. Let p be a 
midpoint of a side of a triangle in T k . If p lies in the interior of a triangle in T k , then we define 

(Jf-i V )(P) : = V (P)- 

Otherwise, if p lies on the common edge of two adjacent triangles T\ and T 2 in T k 1 , then we define 

(ik-i v )(p) '■= + «b(p)]. 


From the definition of I k _ l5 it is clear that 

j*_ i« = v Vu g w k . 1 = v* n y fc _i c • 

In other words, I k _\\w k -i is just the natural injection. 

Now we are ready to state an approximation property. 

Lemma 1 Given u G V k let u* G W k -i be the solution of 

a k (u - it*, v) = 0 Vv G W fc _ 1 . 

Then 
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Proof. Let g £ Vjt satisfy 

(g,v) = a k (u,v) VveV k . 

Then 

Wv £ Wfc_i, a*(u*, v) = a*(u, u) = (flf, u). 
Now let w € Hi (fl) be the solution of the Dirichlet problem 

—V • (aVu>) + flw = g in fl 

w = 0 on dfl . 


Then by elliptic regularity ||tti||jj 2 < C||< 7 ||i, 2 . It follows from the discretization error estimates (4) 
and (5) that 


|u-1/*||i,2 < ||u - w \\ l 7 + ||tt> - U*||l* 


( 9 ) 

( 10 ) 

( 11 ) 


\\g\\h = ( 9,g ) = Q k{u,g) < IIMIMMIl*. 


But 
Therefore, 

Ml> < IMIk*- 

Combining inverse estimate (3) and (11), we obtain 


|u - < ^||« - U*\\ L , < Cfc|u|,,». 0 


3. THE MULTIGRID ALGORITHM 


Now, we consider a decreasing sequence of mesh size h*: 

ho > hi > • • • > hit > • - • > Aw • 

We first describe the fc-level iteration scheme of the conforming multigrid algorithm. The fc-level 
iteration with initial guess z 0 yields CMG(k , z 0 , G) as a conforming approximate solution to the 
following problem. 

Find z £ Wk such that ajt(z, v) = G(v) Vv £ Wk, where G £ W' k . 

Here, W' k is the dual space of W k . For k = 1, CMG( 1, zo, G) is the solution obtained from a direct 
method. For k > 1, CMG{k , zo, G) = z m + /£_i 9, where the approximation z m £ W k is constructed 
recursively from the initial guess zo and the equations 

Zj = z,_i + -^-(G - AkZi-i), 1 < i < m. 

Afc 


321 


Here, A* is greater than or equal to the largest eigenvalue of A k which is the stiffness matrix of a* 
in the conforming finite element space W k , and m is an integer to be determined later. The 
coarser-grid correction q E W*_i is obtained by applying the (k — l)-level iteration 1 time. In other 
words, it is the V-cycle multigrid method. More precisely, 

q = CMG{k- 1,0, G) 

where G E W k _ x is defined by G(y) := G{I k _ x v) — a k (z m , I k _ x v) for all v E W k ~ 

The nonconforming multigrid algorithm of this paper is as follows: The fcmax -level iteration with 
initial guess zq yields NC-CMG(k m&x , zq, F ) as a nonconforming approximate solution to the 
following problem. 

Find z E V* m „ such that 

u) = = J a fv Vu € V kmxx . (12) 

For fc max = 1 , NC-CMG(1, zq, F) is the solution obtained from a direct method. 

For fc max > 1, 

Smoothing Step: the approximation z m € V k is constructed recursively from the initial guess zq 
and the equations 

Zi = Zi - 1 + t—(F ~ Ak m „zi-i), 1 < i < m . (13) 

Afc m „ 

Here, A* m „ is greater than or equal to the largest eigenvalue of A kmxx which is the stiffness 
matrix of a kmtx in the nonconforming finite element space V* m „. 

Correction Step: The coarser-grid correction q E W k - 1 is obtained by applying the (k ma x — l)-level 
conforming iteration 1 time. More precisely, 

q = CMG(k max - 1,0, F) 

where F E is defined by F(v) := F(I k-1 v ) — a k (z m ,I k _ x v) for all v E W k - 1 . 

Put 

NC-CMG(k m „, z 0 , F) = z m + I k k :;:_ { q. 
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4. ESTIMATE OF CONVERGENCE RATE 


Now, we can proceed with the well-known analysis of the conforming multigrid method in [2], 
Define the linear mapping 7 : V k — >■ V*. by 

Jw = 5^1/i (l - t-M for tO = ]T) . 

j \ ''max / 

Here A* s are the eigenvalues of a k . The smoothing step (13) amplifies the error e,- = z — z, by 7, i.e., 
Jc-i— i- Note that 7 is a self adjoint and semidefinite operator with respect to the energy norm. 

Define the weaker seminorm 

M 2 :=£A,(l- T^-W for » = 2>,V,. 

i \ ^max / i 

From (7) and (8) we know ||to|| 2 fc = £A,i/ 2 and |to| < ||to||*. Define the ratio 

HVIHI* if ^ o, 

PK *' \ 0 if to = 0 

It can be regarded as a measure for the smoothness of to G V* because for a smooth function the 
coefficient for small A,-’s dominate and |to| ss ||io[|fc. 


Lemma 2 Given w eV k put p = p(J m w). Then 


||7 m to|| fc < p m \\w\ 


k ■ 


Proof. Similar to the proof of Lemma 4.3. in [2]. 0 

Let £ (G W k ~ i) be the exact coarser-grid correction i.e. 


a k-i(q , v) = F(v) - a k (z m , v) Vo G W k _ i . 


Define 


Q e m • — e m — Q 

Then Q is the a^-orthogonal projector from V k into W£_ v Note that q is a*-orthogonal projection 
of e m into W k -\. 


Lemma 3 Given to G V k we have 


IQHIIi,* < min jl, CyJ 1 -p(to)j ffl to |fi iit . 
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Proof. For to = E we have 




'max 






It follows from Lemma 1 that IQw|i,» < Ch|||ui|;, k . This and the estimate (6) for A m « imply 

IQmiU < C'/i 2 A mM (Ita|i i s— |m| 2 ) 

< CflMu - |uf) 

- 0(1 - 


Moreover, since Q is an orthogonal projector, we have 

lQto||i )fc < min |l,C^/l - p(«>)j 0 


We are now (as in [10]) in a position to define three multigrid iterative schemes for the solution 
of (12). 

1. the symmetric scheme NC-CMGV k : symmetric smoothing NC-CMG scheme 

2. the coarse-to-fine cycle NC-CMG/ k : postsmoothing NC-CMG scheme 

3. the fine- to- coarse cycle: NC-CMG\ k : our NC-CMG scheme. 


In particular, we have [10] 

\\NC-CMG / k \\ k = \\NC-CMG\ k \\ k , 

\\NC-CMGV k \\ k = \\NC-CMG\ k \\l ■ 

The symmetrical method NC-CMGV enables us to use estimates with respect to the energy norm 
and to apply a duality argument. 


Lemma 4 The multigrid algorithm NC-CMGV k has a convergence factor 

\\NC-CMGV k \\ k < max p 2m {e + (1 - e)min(l,C[l - p])} , (14) 

with respect to the energy norm, e is the error in ( k - l)-level CMGV k . i and the constant C ts 
independent of k and m. 
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We note that the right-hand side of (14) is a monotone function of e due to the cut-off induced 
by the min-operation which is contained in the expression. 

Proof. 

z m + 1 = z m + q + ew' (i.e. \\q - ?||* < e||g||*) 
with some w ’ £ W k -\. Hence the error is 


e m +i = e m - q - ew' = Qe m - ew' . 

Since Qe m is orthogonal to W k -\ and w' £ W k - 1 , we get 

||Qe„ - «.'||| = ||Qe„||i + |K||5<||0e m |||+||j|| t (15) 

< ||Qe ra || 2 * + ll(/-<3MI 2 * = IMIi. (16) 

In order to estimate the final error t2 m +\ — J m e m + i> we use a duality argument: 

||e 2 m-t-i||fc = sup^, a(u>, e 2 m +i)/||iu||fc. Note that (16) , Q 2 = Q and Cauchy-Schwarz’s inequality 
imply 


o k (u>, e 2 m +i) = a k (w, J m (Qe m - ew')) 

= a k (J m w, (1 - e)Q 2 e m + e(Qe m - it/)) 

< (1 - e)a k (J m w , Q 2 e m ) + e||J m u>|W|e m || fc 

< (1 - e)\\QJ m w\\ k \\QJ m e 0 \\ k + e|| J"*u,|| fc ||/-e 0 |U 

< [(1 - + <|| J’“<i|| 2 J 1 ' 2 [(l - e )||OJ"«o||l + <11 •7"'e„|| 2 ] 1 ' 2 • 

Given w £ V k by the Lemmas 2 and 3 it follows that 

(1 - <)||Q/-u>||i + <||J”w|| 2 4 < P 2m {e + (l - «)min(l,C[l - d)}|M| 2 , 
where p = p(J m w). Hence 

ll«2»+i||* < max p 2m {e + (1 - e) min(l, C[ 1 - p])}||e 0 || fc . □ 


Theorem 5 If ||CM?\ fc _i||jt_i < 6 1 ! 2 where < <5 < 1, then 

\\NC-CMG\ k \\ k < 8 1 / 2 . 


Proof. We conclude from Lemma 4, 


WNC-CMGV^ = max *> 2 ”>{<S + (1 - 6) min(l, C[1 - p])> , 


because ||CMGVJfe_i||*_i = ||CM7\jt_i||£_j < 6. Maximum 6 is attained at p = 1 when 6 > 

\\NC-CMG\ k \\ k = \\NC-CMGV k \\l /2 < 6 l > 2 . □ 
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Table I: Number of Grid = 8 i.e. h = 1/8 



S-NCMG 

M-NCMG 

NC-CMG 

smoothing 

iter 

time(sec) 

iter 

time(sec) 

iter 

time(sec) 

^ 1 

4 

.909 

3 

.788 

3 

.233 

2 

3 

.689 

2 

.523 

2 

.156 

3 

2 

.471 

2 

.540 

2 

.170 

4 

2 

.483 

2 

.549 

2 

.177 


Since the conforming multigrid method with the V-cycle and arbitrary smoothing step is convergent 
we can choose S such that 1 > 6 > c+Zm anc ^ ||C r A/G\/ c _i||^-i < $ ^ ■ 

5. EXPERIMENTAL RESULTS 


We implement the standard nonconforming multigrid algorithm S-NCMG in [4], the modified 
standard nonconforming multigrid algorithm M-NCMG in [3] and NC-CMG with the V-cycle for 
the Laplace’s equation 


—A u — —1 in fi = unit square, 

u = 0 on . 

Let {^, . . . , <f>l k } be the basis of V k such that each <f>) equals 1 at exactly one midpoint and equals 
0 at all other midpoints. The stiffness matrix representing a k ( •, *) with respect to this basis of 
nonconforming space has at most five entries per row. In the conforming case, the stiffness matrix 
has again at most five entries per row. Therefore z m can be obtained from z$ by iterating a sparse 
band matrix. We use the Gershgorin theorem in order to get the bounds of the maximum 
eigenvalues. These are the rough bounds so that the convergence rate is not optimal, but there is a 
trade-off because finding the exact maximum eigenvalue costs more. Note that the matrix for /£_] 
has again at most five entries per row. 

We take an initial guess zo = 0. The programs execute the multigrid iterations until the discrete 
energy norm of the real error is below the tolerance l/(number of basis) for various mesh size and 
the number of smoothing. The real solution comes from the SSOR preconditioning conjugate 
gradient method for the five point finite difference scheme in which the difference of two consecutive 
solutions is less than the tolerance 10" 9 in the descrete h sense. The experiments reported here 
were run in double-precision arithmetic on CM-5 Vector Units which has 32K processors. 

There are many ways to measure the performance of a parallel algorithm running on a parallel 
processor(see [9]). The most important and commonly used metric is the elapsed cpu time to run a 
job on a given machine even though it depends on how to optimize the program. We used the 
power method to get the rate of convergence. In the Table V-VIII the rate of convergence of 
S-NCMG and M-NCMG is slightly smaller or larger than the rate of convergence of NC-CMG. 
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Table II: Number of Grid = 16 i.e. h = 1/16 



S-NCMG 

M-NCMG 

NC-CMG 

smoothing 

iter 

time(sec) 

iter 

time(sec) 

iter 

time(sec) 

1 

7 

2.604 

5 

2.089 

5 

.766 

t 2 

4 

1.526 

3 

1.187 

3 

.481 

3 

3 

1.183 

3 

1.247 

3 

.512 

4 

3 

1.212 

3 

1.240 

2 

.360 


Table III: Number of Grid = 32 i.e. h = 1/32 



S-NCMG 

M-NCMG 

NC-CMG 

smoothing 

iter 

time(sec) 

iter 

time(sec) 

iter 

time(sec) 

1 

10 

6.037 

7 

4.294 

7 

1.625 

2 

6 

3.723 

5 

3.163 

4 

.970 

3 

5 

3.196 

4 

2.573 

4 

1.034 

4 

4 

2.641 

3 

1.975 

3 

.832 


Table IV: Number of Grid = 64 i.e. h = 1/64 



S-NCMG 

M-NCMG 

NC-CMG 

smoothing 

iter 

time(sec) 

iter 

time(sec) 

iter 

time(sec) 

1 

14 

16.668 

10 

11.879 

9 

2.874 

2 

8 

9.560 

7 

8.396 

5 

1.692 

3 

6 

7.196 

5 

6.059 

4 

1.447 

4 

5 

6.200 

4 

4.987 

4 

1.544 


Table V: Number of Grid = 8 i.e. h = 1/8 



S-NCMG 

M-NCMG 

NC-CMG 

smoothing 

rate of conv. 

rate of conv. 

rate of conv. 

1 

.903 

.903 

.906 

2 

.815 

.815 

.820 

3 

.736 

.736 

.742 

4 

.665 

.665 

.672 





Table VI: Number of Grid = 16 i.e. h = 1/16 



S-NCMG 

M-NCMG 

NC-CMG 

smoothing 

rate of conv. 

rate of conv. 

rate of conv. 

[~~~ 1 

.904 n 

.904 

.910 

2 

.817 

.818 

.829 ! 

3 

.739 

.739 

.754 

4 

.668 

.669 

.687 


Table VII: Number of Grid = 32 i.e. h = 1/32 



S-NCMG 

M-NCMG 

NC-CMG 

smoothing 

rate of conv. 

rate of conv. 

rate of conv. 

1 

f -904 

.904 

.911 

r 2 

.818 

.818 

.830 

3 

.740 

.740 

.757 

4 

.669 

.669 

.689 


Table VIII: Number of Grid = 64 i.e. h = 1/64 



S-NCMG 

M-NCMG 

NC-CMG 

smoothing 

rate of conv. 

rate of conv. 

rate of conv. 

1 

r .904 

.904 

.911 

2 

.939 1 

.818 

.830 

3 

.888 

.740 

.757 

4 

.773 

.669 

.690 





(A) (B) 

Figure 1: Nonconforming vs. conforming. 


In Figure 1, (A) and (B) represent the location of the nodal basis of nonconforming finite 
elements and conforming finite elements, respectively. Squares represent the basis in Vjt_i or Wk- 1 
and circles represent the basis in Vk or Wk- In the correction step the centered black square is 
communicating with the black circles around it. Therefore S-NCMG and M-NCMG need further 
communications. Since the performance is determined mainly by the communication time in a 
massively parallel machine like CM-5, S-NCMG and M-NCMG require more cpu time than 
NC-CMG. It is shown in tables I-IV. Moreover NC-CMG does less computation and is easier to 
implement because the number of the basis of Vk is approximately three times of that of Wk and 
W k . i C W k . 
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SUMMARY 


Several iterative algorithms based on multigrid methods are introduced for solving linear 
Fredholm integral equations of the second kind. Automatic programs based on these algorithms 
are introduced using Simpson’s rule and the piecewise Gaussian rule for the numerical 
integration. 


INTRODUCTION 


Several multigrid iterative methods based on the Nystrom method are applied for the fast 
solution of the large dense systems of equations that arise from the discretization of Fredholm 
integral equations of the second kind. We will consider the linear Fredholm integral equation of 
the second kind, 

Ax(s) — f k(s,t)x(t)dt — y(s), s£D (1) 

J D 

with D a bounded close domain, and y£ X where X is the underlying Banach space. Necessary 
assumptions are 

(i) k(s,t) is such that the associated integral operator K is compact from X into X 

(ii) A is not an eigenvalue of K and A / 0 

The Nystrom method for solving (1) uses some type of numerical integration to obtain the 
approximating equation 


7l{ 

Ax,(s) - J2 a J (*)*/(*;) = 2/0)> s £ D (2) 

j = i 

the nodes <i , < 2 , ...., t ni are in D, and x/(f) = x(i). The weights c*j(.s) can be defined in a variety of 
ways, depending on the smoothness and form of the kernel function. If k(s, t ) and x(t) are 
reasonably smooth, usually aj(s) = Wjk(s,tj), where 
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( 3 ) 


/ f{t ) dt » 

jd j= i 

is a numerical integration formula. Let the numerical integration operator Ki be defined by 

I<ix(s) = ^2wjk(s,tj)x(tj), s£D (4) 

3 = 1 

Using (2) and (4), (1) approximated by the linear system 

A x t (ti) - ]T Wjk(ti,tj)xi(tj) = y(U) (5) 

i=i 

We will denote (1) and (5) symbolically as 


(A - K)x = y 


( 6 ) 


and 


(A - Ki)xi = y (7) 

respectively. Our discussion is based on the convergence of a sequence of approximations to the 
unique solution of (1). 


In finding numerical solutions for equations (1), the system (5) is too large to be solved 
directly. The purpose of this paper is to consider some iterative variants of (4). The basic 
assumptions needed in our algorithms are given in section 2. In section 3, linear iterative 
algorithms are given based on Simpson’s rule and piecewise Gaussian quadrature rule for the 
numerical integraion formulae. And in the section 4, we include numerical examples. 


BASIC ASSUMPTIONS 


The methods will be defined and discussed using the abstract formulation of Anselone [1] and 
Atkinson [3], [4] for families of collectively compact operators. 

Let X;,/ = 0, l,2...,be finite-dimensional subspaces of the Banach space X and let 
P/, / = 0, 1,2, ..., be a bounded projection operator from X onto X/. We need the following 
assumptions for {X/} and {Pi} 


(Al) Xo C X x C .... C X,... C X 
(A2) lim ||/ - P//|| = 0 for all f E X 

/—►oo 
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The sequence {X;} is thought as being associated with a sequence of decreasing meshsizes {h/} 
with lim hi = 0. Corresponding with this sequence {hi}, we approximate K by a sequence of 

/— ►OO 

operators {A'/}, A } : X — » X. In multigrid iteration, the subscript 1 is called ”level”.The 
hypotheses on (A) : / > 1} and K are as follows. 

(A3) K and A), / > 1 are linear operators on the Banach space X into X. 

(A4) Kix — > Kx asn-t oo, for all ifl. 

(A4) {A(} is a collectively compact family of operators. 

The following is a consequence of the assumptions (A3) - (A5): 

Lemma 1 Assume (A3) - (A5). Then with n defined as in (3) 

(i) K is compact 

(ii) ||(A — A'/)A|| and ||(A' — A/)A/|| converge to zero as n — ► oo 

(iii) If at = sup sup ||(A^ — A m )AT n || , then lim at = 0 

m>f n>l i -*°° 

Proof. See Atkinson [4], 

Lemma 2 If (X — A) -1 exists, then 
(A - a:,)- 1 exists for sufficiently large l, say 

||x - x t || < c 2 (A) 

where Xi = (A — Ki)~ l y 
Proof. See Atkinson [4], 

This shows xi — » x and gives a rate of convergence. 


N{ A), and is uniformly hounded by c 2 (A) and 
\\Kx-K,x\\, l > N{\) 


LINEAR ITERATIVE METHODS 


Multigrid Methods 


Assume that x; i0 denotes a approximate solution of (7) with residual 

di = yi-(X- J<i)xi fi (8) 

Then improve on the accuracy by writing 

xt, i = x/,o + Si (9) 

where the correction Si satisfies the residual correction equation 
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(A - Ki)S, = d, 


( 10 ) 


In general, the correction term Si will be small, and it is unnecessary to solve the residual 
correction equation (10) exactly. Thus we may write 

6i = B t di ( 11 ) 

where Bi denotes a bounded linear operator approximating (A — A'/) -1 . By (??) and (9) together 
with (11), we obtain 


xi , i = [A - B t ( A - Ki)]xi, 0 + Biy t (12) 

as the new approximate solution to (7). The equation (11) can be represented well by means of 
coarser grid functions 


(A — A/_i)6/_i = di-i (13) 

where d/_iis chosen reasonably and depends linearly on d;. If r : Xi — > Xi-\ is the restriction 
mapping, then 

di-\ = rdi (14) 

Having defined d;_iby (14), 6/_i is obtained using (11) at level / — 1. Having obtained <5;_iwhich 
is defined only on the coarse grid level, we need to interpolate this coarse-grid function by 

Si = pS,.x (15) 

where p describes the prolongation of a coarse grid function to a fine grid function. 

We note here that the choice of the prolongation p in (15) must be small enough to satisfy 

||/-pr||<<7fcf (16) 

where the consistency order r depends on the discretization, (e.g. on the order of the quadrature 
formula). For the restriction operator r, we will consider both trivial injection and Nystrom type 
restriction. 

Our automatic algorithm is based on the following multigrid iteration which is given as a 
recursive procedure. 


Multigrid iteration for solving (A — A/)x/ = y 

Procedure Multigrid ( l,xi,y ) (17) 

if / = 0 then 

solve x 0 = (A — Ki)~ l y 
otherwise 
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XI = \[I<ix t + y] 
d t = (I - Ki)x t - y 
di-i = rdi 

repeat the Procedure Multigrid with (/ — l,5/_i,d/_i) 
xf ew = xi — p5i-i 

We now give some basic results of the multigrid algorithm (17) that are used in our automatic 
algorithm. 


Let (k be the contraction number of the multigrid iteration employed at level k 


-i+i 


Xk 


<a 


Xk 


(18) 


Then it is known that {( k } are uniformly bounded by some ( < 1- 


Let 


(:as mg, (i (19) 

where / is the maximum level in (17). The relative discretization error, the difference between x k 
and x k - 1, is often estimated by 


||pzfc-i - x k \\ < Cih T k 
for 1 < k < l 

where p is a prolongation operator and r is the consistency order. 


( 20 ) 


Theorem 3 Assume (20) and 

C 2 C < 1 

with 


( 21 ) 


Co := max 

i<fc<i 


h-i 

hk 


T 


then the i th iteration of the multigrid procedure (17) at level k results in x k and satisfies the 
error estimate 


||®fc — x k \\ < CsCihfJ 

for 0 < k < / 


where 


C . 3 = 


i - c 2 c 


( 22 ) 


(23) 
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Proof. See Hackbush [11]. 


Theorem 4 Assume the validity of (22) and suppose | then the i th iteration of the 

multigrid procedure (17) at level k results in x satisfies the error estimate 


k ^Cfc 0 4 \\Xk ^ | 


where 


r _ (2 T -IX' 

C --T^c r 


(24) 

(25) 


Proof. See Hackbush [11]. 


Automatic Algorithms 


The automatic algorithm £* in (18) is used to estimate the iteration error. Then together 
with the discretization error the global error in the solution is estimated. Often is estimated by 


(k = 


J+l 


x k 

- x k 

j 

j — 1 

x k - 

x k 


Then 


1- J 1 

. (k 

+ 

Xk - X k 

II 

1 

r*- 


(26) 


(27) 


is used to estimate the iteration error. Thus at any level, a minimum of two iteration is required 
to estimate the iteration error. However, (24) together with (25) can be used to estimate ( using 


C 4 


iteration error 


discretization error 
and it will enable us to estimate (27) with only one iteration. 


(28) 


Our first algorithm is based on Simpson’s rule with double the node points as the level 
increases, i.e. dimension of the linear system at a level l is 2 ?+1 + 1. In this case we have C 2 = 16 
in (21). Thus by the condition (21), if £ < ^ the estimates in (22) holds with i=l, i.e. only one 
multigrid iteration per level. The result is computational savings. As the level increases the 
amount of computation increases, so that there is a significant time savings in performing only one 
iteration as the dimension of the linear system being solved becomes larger. Moreover ( k in (18) 
goes to zero as the level k increases, which means that after a certain level k, becomes so small 
that the iteration error becomes much less significant than the discretization error, hence more 
accurate estimation of it is not needed. Thus one iteration is sufficient at this stage. 
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The second algorithm is based on the piecewise Gaussian quadrature rule for the numerical 
integration scheme. We adapt the iteration error estimation scheme discussed earlier. 

For simplicity we use hi = for l = 1,2, ... This means that we reduce the length of each 
subinterval by half as the level increases. Suppose at some level /, we have a partition 

Ql = {a = qo < q\ < .... < q mi = b} (29) 

with 

qi = a + i * h\ for i = 0, 1 , 2, m/ 
and mi = 2 l :=number of subintervals, for / = 0, 1, 2, .... 

Then 

b mi p 

f{t)dt = X) hi ^2 wj [{qi- 1 + hitj) 

1=1 j - 1 

where 

/ = ^Wj f{ij) 

J0 j= 1 

is the Gaussian quadrature rule on [0,1] with p node points. 

Unlike Simpson’s rule, we do not have nested node points. In the following algorithm, both 
restriction and prolongation are done with Nystrom type interpolation. 

Procedure Multigrid with piecewise Gaussian (/, X/, y ) (32) 

if / = 0 then 

solve x 0 = (A — K 0 )~ l y 
otherwise 

xi - j[l<ixi + y] 

di = (A — Kt)xi — y — Kixi — K\X\ 
di-x = r(I<iX( - Kixi ) 

repeat the Procedure Multigrid with (/ — 1, 

xf ew = zt- pSt - 1 

Nystrom type interpolations as in the procedure (32) are costly. Each interpolation involves 
O(n^) multiplications at each level. However this can be improved as suggested in our conclusion 
later. 


(30) 

(31) 


The following theorem which is due to Atkinson-Potra [7] gives the theoretical iterative rate of 
convergence for piecewise Gaussian quadrature with Nystrom type interpolation. We will assume 
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that the kernel fc(s,f) belongs to the class (7(0,7)- This means that the kernel k(s,t ) has the 
following properties: 

(Gl) Define 

= {(s,t) \a<s<t<b) 

<p 2 - {(5,0 I « < t < s < b) 

Then there are functions k, € = 1,2 

with 

k{s,t) =kj(s,t), (s,t)G'k 1, 

fc(s,t) =k 2 (M)> (s,f)e^2 

(G2) If 7 > 0, then k( s,t) G ^([0,6] x [a, 6]). If 7 = -1, then the kernel k{s,t) may have a 
discontinuity of the first kind along the line t — s 

Theorem 5 Assume that k(s,t) G Then solve the Nystrom equation 

xi{s ) = ^2wjk(s,tj)xi(tj) + y(s) (33) 

j=i 

using piecewise Gaussian quadrature rule with p node points in subintervals by first 
obtainning xi(t\), xi(tjv) as a solution of the linear system 

xi(U ) = Y, w 3 k(ti,tj)xi(tj) + y{t,) (34) 

j= 1 

then using (33) as an iterpolation formula gives an error estimate 

||x = £/|| = 0{h, w ) (35) 

where w = min{ct, 2 p, 7 + 2} . 


Proof. See Atkinson-Potra [7] for the case p=r+l. 

Finally to determine i, the needed number of iteration at any level /, use (24) and (25) with 
r = 2p, hence C 2 = 2 2p . 


Automatic Implementation 


Our automatic implementation is divided into two stages based on the results from the 
iteration method. In stage 1, (A - K m )x m = y is solved directly, and then an attempt is made to 
solve (A - Ki)xi -y for l > m, iteratively. If the rate of convergence is sufficiently rapid then 
the stage 2 is entered. Otherwise m is replaced by / and the stage 1 is repeated. In stage 2, the 
value of m will serve as the coarsest grid level in the multigrid procedure (17) and solve 
( \ — Ki)xi = y iteratively until termination of the algorithm. The iteration procedure attempts to 
use the minimum number of iterates such that once the iterative solutions satisfy a certain criteria 
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we will try to estimate the rate of convergence asymptotically, which enables the estimation of the 
rate of convergence with only one iteration per level. As shown in our numerical examples, this 
scheme results in computational savings at finer grid levels. 

The initial guess for an iteration of the higher level is the interpolation of the solution of the 
preceding level which may have been obtained either directly or iteratively. The error ||x — x m \\ 
and ||x — x/|| in stages 1 and 2, respectively, are monitored continuously, regardless of whether the 
iteration method is being used or not. Thus the multigrid iteration may not have been invoked 
successfully before the attainment of an answer within the desired error tolerance. 

In order to estimate the global error in the current solution, we need to monitor the 
discretization error and the iteration error. For the iteration error estimation, (27) is used with 
estimated ( in place of ( k . In stage 1, a test is made to determine whether the speed of 
convergence is sufficient to enter stage 2. If 

C < [Ratio] 1 / 2 (36) 

then the speed of convergence is adequate for stage 2. This requirment will usually insure that 
only two iterates are needed to be calculated in stage 2 at any given level. The number Ratio is 
the theoretical rate at which the error in xi should decrease when / is increased to the next level. 
In our case, since we are doubling the node points as the level increases, Ratio = with T = 4 
for Simpson’s rule and t — 2p for p points piecewise Gaussian quadrature in each subinterval. 

For the discretization error estimation, we compute the rate at which the error is decreasing 
for the current level. For each computed level l, 


NumDE : = ||x / -x,_ 1 || (37) 

and let DenDE be the previous value of NumDE, if any. Then the rate is computed using 

NumDE 


DE 1 DenDE 
Using this value of DE, we estimate the error x — x/, 

DE 


(38) 


Error 


NumDE 


.1 - DEJ 

which is a standard error estimate for sequences which are converging geometrically with a rate 
DE. Having estimated Error as in (39), we use the final test 


(39) 


Error < e 

with e a desired error tolerance supplied by the user. 

To ensure that only needed accuracy in x t is computed, we want to test 

iteration error < quadrature error 


(40) 


(41) 
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This is done by 


r (2) _ _(1) 
•i. I Jb i 


< 


' 1-C 

. c > 


DE 


A - DE 


3 .( 2 ) _ (o) 

X [ X J 


The test (42) is obtained by using (41) and the approximations 


\x - xi\\ = 


— )l 

- DE/ I 


t ( 2) _ _(o) 

Xj — X / 


(42) 


(43) 






(44) 


If the test (42) is not satisfied, then the new iterate is calculated, and (42) is tested again. 
Once an iterate is acceptable according to (42), we check for accuracy in the most recently 
computed iterate using (39) and (40). 


NUMERICAL EXAMPLES 


The integral equation 


x(s) — A / k(s, t)x(t)dt = y(s), 
J a 


a < s < b 


is solved with the kernel 


(45) 


k(s , t ) = COs( 7 TS<) 

on [0,1]. A variety of parameters A that are close to the dominant characteristic values (the 
reciprocals of eigenvalues) are considered, as the equation becomes more difficult to solve as A 
approaches characteristic values. The dominant characteristic value that we use in our example is 
1.4278. The right hand function y(s) is so chosen that 

x(s) = e x cos(7s), 0 < s < 1 (46) 


Table I. The First Algorithm 


A 

Desired 

Estimated 

Actual 

Dimension (Level) 
Coarsest Finest 

1.00 

l.OE-6 

6.82E-7 

6.76E-7 

3(0) 

65 (5) 

1.40 

1.0E-4 

1.62E-5 

1.60E-5 

5(1) 

65 (5) 

1.43 

1.0E-4 

1.31E-5 

1.31E-5 

5(1) 

129 (6) 
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Vfllll In 7^970 T^r ted C ° lumn iS COmputed usin s ( 39 )- As A approaches the characteristic 
value ot 1.4278, both the coarsest grid level and the finest grid level were increased. In Table II 

we give the iterative rate of convergence at each level, and the number of iterations performed at 
eac level is also given in parentheses. As noted in section 3, only one iteration is needed as the 
level increases. Whenever only one iteration is performed at any given level, the iterative rate of 
convergence is the maximum contraction number C in (19) estimated using (24) and (25) 


Table II. Iterative Rate of Convergence of The First Algorithm 


A 

Desired 


Level 


1 

2 

3 

1.00 

l.OE-6 

2.10E-2 (2) 

5.14E-2 (1) 

2.03E-3 (1) 

1.40 

1.0E-4 

2.10E-1 (2) 

5.31E-2 (2) 

7.57E-3 (2) 

1.43 

1.0E-4 

- 

1.44E-1 (2) 

1.44E-2 (2) 



4 

5 

6 

1.00 

l.OE-6 

3.40E-3 (1) 

3.80E-3 (1) 

_ 

1.40 

1.0E-4 

5.93E-2 (1) 

3.79E-3 (1) 


1.43 

1.0E-4 

4.40E-2 (1) 

3.75E-3 (1) 

3.89E-3 (1) 


For the second algorithm, the coarsest level corresponds to two subintervals. In order to give a 
reasonable comparison with the first algorithm, we first give the results with 2 node points in each 
subinterval. Thus the quadrature order coincides with that of the first algorithm. 


Table III. The Second Algorithm with p=2 


A 

Desired 

Estimated 

Actual 

Dimension (Level) 
Coarsest Finest 

1.00 

l.OE-6 

6.82E-7 

6.76E-7 

4(0) 

64 (5) 

1.40 

1.0E-4 

1.62E-5 

1.60E-5 

4(0) 

64 (5) 

1.43 

1.0E-5 

8.74E-6 

8.72E-6 

4(0) 

128 (6) 


In the next table, we have results from the second algorithm with 
subinterval. To show the superiority of the Gaussian quadrature rule 
desired error for A = 1.40 and A = 1.43. 


more node points on each 
we give results for a smaller 


Table IV. The Second Algorithm with p=3,4 
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Dimension (Level) 


A 

P 

Desired 

Estimated 

Actual 

Coarsest 

Finest 

1.40 

3 

1.0E-8 

1.95E-9 

1.93E-9 

6(0) 

96 (4) 

1.43 

3 

1.0E-8 

3.97E-10 

3.96E-10 

6(0) 

192 (5) 

1.43 

4 

1.0E-8 

6.52E-10 

6.28E-10 

8(0) 

64(3) 


Table V. Iterative Rate of Convergence of The Second Algoritm with p-3, 4 


A 

P 

Desired 


Level 


1 

2 

3 

1.40 

3 

l.OE-8 

1.09E-4 (2) 

1.43E-6 (1) 

8.84E-4 (1) 

1.43 

3 

1.0E-8 

1.39E-3 (2) 

1.04E-2 (1) 

2.10E-4 (1) 

1.43 

4 

l.OE-8 

9.35E-6 (2) 

2.55E-3 (1) 

1.30E-5 (1) 





4 

5 

1.40 

3 

l.OE-8 


9.90E-4 (1) 

- 

1.43 

3 

l.OE-8 


2.36E-4 (1) 

2.42E-4 (1) 

1.43 

4 

l.OE-8 


- 



CONCLUSION 


The piecewise Gaussian rule is superior to Simpson’s rule. However, as pointed out in section 
3 restrictions and prolongations are done with Nystrom type interpolation. And it involves 0(n, ) 
multiplications at each level l without counting kernel evaluations. It appears that these 
operations cause the bottleneck of our algorithms. We are in the process of applying the idea 
suggested by Achi Brandt in [9] to our current algorithms which will reduce the operation count 
by far. Our preliminary results appear to be promising, and progress is being made m developing 

them further. 
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SUMMARY 

Self-adaptive mesh refinement dynamically matches the computational demands of a solver for 
partial differential equations to the activity in the. application’s domain. In this paper we present 
tw0 C++ class libraries, P++ and AMR++, which significantly simplify the development of 
sophisticated adaptive mesh refinement codes on (massively) parallel distributed memory 
architectures. The development is based on our previous research in this area. The C++ class 
libraries provide abstractions to separate the issues of developing parallel adaptive mesh refinement 
applications into those of parallelism, abstracted by P++, and adaptive mesh refinement, 
abstracted by AMR++. P++ is a parallel array class library to permit efficient development of 
architecture independent codes for structured grid applications, and AMR++ provides support for 
self-adaptive mesh refinement on block-structured grids of rectangular non overlapping blocks. 

Using these libraries the application programmers’ work is greatly simplified to primarily specifying 
the serial single grid application, and obtaining the parallel and self-adaptive mesh refinement code 
with minimal effort. 

Initial results for simple singular perturbation problems solved by self-adaptive multilevel 
techniques (FAC, AFAC), being implemented on the basis of prototypes of the P++/AMR++ 
environment, are presented. Singular perturbation problems frequently arise in large applications, 
e.g. in the area of computational fluid dynamics. They usually have solutions with layers which 
require adaptive mesh refinement and fast basic solvers in order to be resolved efficiently. 

INTRODUCTION 

The purpose of local mesh refinement during the solution of partial differential equations 
(PDEs) is to match computational demands to an application’s activity: In a fluid flow problem this 
means that only regions of high local activity (shocks, boundary layers, etc.) can demand increased 
computational effort; regions of little flow activity (or interest) are more easily solved using only 
relatively little computational effort. In addition, the ability to adaptively tailor the computational 
mesh to the changing requirements of the application problem at runtime (e.g. moving fronts in 
time dependent problems) provides for much faster solution methods than static refinement or even 
uniform grid methods. Combined with increasingly powerful parallel computers that are becoming 
available, such methods allow for much larger and more comprehensive applications to be run. With 
local refinement methods, the greater disparity of scale introduced in larger applications can be 
addressed locally. Without local refinement, the resolution of smaller features in the applications 
domain can impose global limits either on the mesh size or the time step. The increased 
computational work associated with processing the global mesh cannot be readily offset even by the 
increased computational power of advanced parallel computers. Thus, local refinement is a natural 
part of the use of advanced massively parallel computers to process larger and more comprehensive 
applications. 


1 Revised and shortened version of [10]. This research has been supported by the National Aeronautics and Space Ad- 
ministration under grant number NASI- 18606 and the German Federal Ministry of Research and Technology (BMFT) 
under PARANUSS, grant number ITR 900689. 

2 Part of this work belongs to the author’s dissertation. 
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Our experiments with different local refinement algorithms for the solution of the simple 
potential flow equation on parallel distributed memory architectures (e.g. [8]) demonstrates that, 
with the correct choice of solvers, performance of local refinement codes shows no significant sign of 
degradation as more processors are used. In contrast to conventional wisdom, the fundamental 
our adaptive mesh refinement methods do not oppose the requirements for 
efficient vectorization and parallelization. However, the best choice of the numerical algorithm is 
highly dependent on its parallelization capabilities, the specific application problem and its adaptive 
grid structure, and, last but not least, the target architectures 5 performance parameters. Algorithms 
that are expensive on serial and vector architectures, but are highly parallelizable, can be superior 
on one or several classes of parallel architectures. 

Our previous work with parallel local refinement, which was done in the C language to better 
allow access to dynamic memory management, has permitted only simplified application problems 
on non block structured composite grids of rectangular patches. The work was complicated by the 
numerical properties of local refinement, including self adaptivity and their parallelization 
capabilities like, for example, static and dynamic load balancing. In particular, the explicit 
introduction of parallelism in the application code is very cumbersome. Software tools for 
simplifying this are not available, e.g., existing grid oriented communication libraries (as used in [6]) 
are far too restrictive to be efficiently applied to this kind of dynamic problem. Thus, extending this 
code for the solution of more general complex fluid flow problems on complicated block structured 
grids is limited by the software engineering problem of managing the large complexities of the 
application problem, the numerical treatment of self-adaptive mesh refinement, complicated grid 
structures, and explicit parallelization. The development of codes that are portable across different 
target architectures and that are applicable to not just one problem and algorithm, but to a larger 
class, is impossible under these conditions. ‘ “ 

9 ur sol ution to this software difficulty presents abstractions as a means of handling the 
combined complexities of adaptivity, mesh refinement, the application specific algorithm, and 
parallelism. These abstractions greatly simplify the development of algorithms and codes for 
complex applications. As an example, the abstraction of parallelism permits the development of 
application codes (necessarily based on parallel algorithms as opposed to serial algorithms, whose 
data and computation structures do not allow parallelization) in the simplified serial environment, 
and the same code to be executed in a massively parallel distributed memory environment. 

This paper introduces an innovative set of software tools to simplify the development of parallel 
adaptive mesh refinement codes for difficult algorithms. The tools are present in two parts, which 
^ libraries and allow for the management of the great complexities described above. 
The first class library, P++ (short summary in Section 2, details in [10]), forms a data parallel 
superset of the C++ language with the commercial C++ array class library M++ (Dyad Software 
Corporation). A standard C++ compiler is used with no modifications of the compiler required. 

The second set of class libraries, AMR++ (Section 3), forms a superset of the C++/M++, or P++, 
environment and further specifies the combined environment for local refinement (or parallel local 
refinement) . In Section 4 we introduce multilevel algorithms that allow for the introduction of 
Tt f A a 4u Ptive mesh refinement (Asynchronous) Fast Adaptive Composite Methods (FAC and 
AFAC)). In Section 5, we present first results for a simple singular perturbation problem that has 
been solved using FAC and AFAC algorithms being implemented on the bases of AMR++ and 
P++ prototypes. This problem serves as a good model problem for complex fluid flow applications, 
because several of the properties that are related to self-adaptive mesh refinement are already 
present in it. 

We are particularly grateful to Steve McCormick, without whose support this joint work would 
not have been possible, and to the people at the Federal German Research Center Jiilich (KFA) for 
their generous support in letting us use their iPSC/860 environment. In addition we would like to 
thank everybody who discussed P ++ or AMR++ with us or in any other way supported our work. 
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P++, A PARALLEL ARRAY CLASS LIBRARY FOR STRUCTURED GRIDS 

P ++ is an innovative, robust, and architecture-independent array class library that simplifies 
the development of efficient parallel programs for large scale scientific applications by abstracting 
parallelism. The target machines are current and evolving massively parallel distributed memory 
multiprocessor systems (e.g. Intel iPSC/860 and PARAGON, Connection Machine 5, Cray MPP, 
IBM RS 6000 networks) with different types of node architectures (scalar, vector, or superscalar). 
Through the use of portable communication and tool libraries (e.g. EXPRESS, ParaSoft 
Corporation), the requirements of shared memory computers are also addressed. The P++ parallel 
array class library is implemented in standard C++ using the serial Mh — f- array class library, with 
absolutely no modification of the compiler. P++ allows for software development in the preferred 
serial environment, and such software to be efficiently run, unchanged, in all target environments. 
The runtime support for parallelism is both completely hidden and dynamic so that array partitions 
need not be fixed during execution. The added degree of freedom presented by parallel processing is 
exploited by use of an optimization module within the array class interface. For detail, please refer 
to [10]. 

Application class: The P++ application class is currently restricted to structured grid-oriented 
problems, which form a primary problem class currently represented in scientific supercomputing. 
This class is represented by dimensionally independent block structured grids (ID - 4D) with 
rectangular or logically rectangular grid blocks. The M++ array interface, which is also used as the 
P ++ interface and whose functionality is similar to the array features of Fortran 90, is particularly 
well suited to express operations on grid blocks to the compiler and to the P++ environment at 
runtime. 

Program, wing Model and Parallelism.: P++ is based on a Single Program Multiple Data Stream 
(SPMD) programming model, which consists of executing one single program source on all nodes of 
the parallel system. Its combination with the Virtual Shared Grids (VSG) model of data parallelism 
(a restriction of virtual shared memory to structured grids, whose communication is controlled at 
runtime) is essential for the simplified representation of the parallel program using the serial 
program and hiding communication within the grid block classes. Besides different grid partitioning 
strategies, two communication update principles are provided and automatically selected at 
runtime: Overlap Update for very efficient nearest neighbor grid element access of aligned data and 
VSG Update for general grid (array) computations. By use of local partitioning tables, 
communication patterns are derived at runtime, and the appropriate send and receive messages of 
grid portions are automatically generated by P++ selecting the most efficient communication 
models for each operation. As opposed to general Virtual Shared Memory implementations, VSG 
allows for obtaining similar parallel performance as for codes based on the traditionally used explicit 
Message Passing programming model. Control flow oriented functional parallelism until now is not 
particularly supported in P++. However, a cooperation with the developers of CC++ ([4]) is 
planned. 

Summary of P++ Features: 

• Object oriented indexing of the array objects simplifies development of serial codes by 
removing error prone explicit indexing common to for or do loops. 

• Algorithm and code development takes place in a serial environment. Serial codes are 
re-compilable to run in parallel without modification. 

• P++ codes are portable between different architectures. Vectorization, parallelization and data 
partitioning are hidden from the user, except for optimization switches. 

• P++ application codes exhibit communication as efficiently as codes with explicit message 
passing. With improved C++ compilers and an optimized implementation of M++, single 
node performance of C++ with array classes has the potential to approximate that of Fortran. 
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Current State, Perforin mi re Issues and Related Work: The P++ prototype is currently 
implemented on the bases of the AT&T C++ C-Front precompiler using the Inte - 
communication library (or, on an experimental basis, “ 

library from Caltech). Current versions are running on the Intel 1PSC/86O Hypercube, the Intel 
Simulator, SUN workstations, the Cray 2, and IBM PCs. The prototype contains all major concepts 
described above. At several points, without loss of generality, its functionality is restricted to the 
needs within our own set of test problems (3D multigrid codes and FAC/AFAC codes). 

The feasibility of the approach has been proven by the successful implementation and use of our 
set of test problems on the basis of P++, in particular, the very complex AMR++ class library The 
results that have been obtained with respect to parallel efficiency, whose optimization was <me of the 
major goals of the P++ development, are also very satisfying: Comparisons for P++ and fortran 
with message passing based test codes, respectively, have shown that the number of messages and 
the amount of communicated data is roughly the same. Thus, besides a negligible over head ,sim lar 
parallel efficiency can be achieved. With respect to single node performance, only ffittle optimization 
has been done. The major reason is that the used system software components (AT& I C++ 

C-Front precompiler 2.1, M++) are not very well optimized for the target machines. However ou 
experiences with C++ array language class libraries on workstations and on the Cray Y-Mf (m 
elaboration with Sandia National Laboratories: about 90% of the Fortran vector performance is 
achieved) are very promising: With new optimized system software versions, Fortran per or 
can be approximated. Therefore, altogether we expect the parallel performance for P++ based 
codes to be similar to that obtained for optimized Fortran codes with explicit message passing. 

AMR++, AN ADAPTIVE MESH REFINEMENT CLASS LIBRARY 

AMR++ is a C++ class library that simplifies the details of building self-adaptive mesh 
refinement applications. The use of this class library significantly simplifies the construction of local 
refinement codes for both serial and parallel architectures. AMR++ has been developed in a serial 
environment using C++ and the M++ array class interface. It runs m a parallel environment, 
because M++ and P++ share the same array interface. The nested set of abstractions provided by 
AMR++ uses P++ at its lowest level to provide architecture independent support. Therefore, 
AMR++ inherits the machine targets of P++, and, thus, has a broad base of machines onwhic o 

run. The efficiency and performance of AMR++ is mostly dependent on th ® j C a mp 4-V class 

P++ in the serial and parallel environments respectively. In this way, the P++ and AMR++ class 
libraries separate the abstractions of local refinement and parallelism to significantly ease the 
development of parallel adaptive mesh refinement applications in an architecture independent 
manned. The AMR++ class library represents work which combines complex numerical, computer 
science, and engineering application requirements. Therefore, the work naturally involves 
compromises in its initial development. In the following sections, the features and current 
restrictions of the AMR++ class library are summarized. 

Block Structured Grids Features and Restrictions: The target grid types of AMR++ are 2D 
and 3D block structured with rectangular or logically rectangular blocks. On the one hand, they 
allow for a very good representation of complex internal geometries introduced through local 
refinement in regions with increased local activity. This flexibility of local refinement block 
structured grids equally applies to global block structured grids that allow for matching complex 
external geometries. On the other hand, the restriction to structures of rectangular blocks, as 
opposed to fully unstructured grids, allows for the application of the VSG programming model of 
p++ an d therefore, is the foundation for good efficiency and performance m distributed 
environments, which is one of the major goals of the P++/AMR++ development. T hu s, we beli eve 
that block structured grids are the best compromise between full generality of the ^id structure 
and efficiency in a distributed parallel environment. The application class forms a broad cross 
section of important scientific applications. 


348 


In the following, the global grid is the finest uniformly discretized grid that covers the whole 
physical domain. Local refinement grids are formed from the global grid, or recursively from 
refinement grids, by standard refinement with hfi ne = ^h coarse in each coordinate direction. Thus, 
boundary lines of block structured refinement grids always match grid lines on the underlying 
discretization level. The construction of block structured grids in AMR++ has some practical 
limitations that simplify the design and use of the class libraries. Specifically, grid blocks at the 
same level of discretization cannot overlap. Block structures are formed by distinct or connected 
rectangular blocks that share their boundary points (block interfaces) at those places where they 
adjoin each other. Thus, a connected region of blocks forms a block structured refinement grid. It is 
possible that one refinement level consists of more than one disjunct block structured refinement 
grid. In the dynamic adaptive refinement procedure, refinement grids can be automatically merged, 
if they adjoin each other. 



(a) 3-level composite grid (b) adjoining ( c ) composite grid tree 

grid blocks 

Figure 1: Example of a composite grid, its composite grid tree, and a cut out of 2 blocks with their 
extended boundaries and interface. 


In Figure 1 (a), an example for a composite gnd is illustrated: The composite grid shows a 
rectangular domain within which we center a curved front and a corner singularity. The grid blocks 
are ordered lexicographically: the first digit represents the level, the second digit the connected 
block structured refinement grid, and the third digit the grid block. Such problems could represent 
the structure of shock fronts or multi-fluid interfaces in fluid flow applications: In oil reservoir 
simulations, for example, the front could be an oil water front moving with time and the corner 
singularity could be a production well. In this specific example, the front is refined with two block 
structured refinement grids: the first grid on refinement level 2 is represented by grid blocks 2.1.1 
and 2.1.2, and the second grid on level 2 by grid blocks 3.1.1, 3.1.2 and 3.1.3. In the corner on each 
of the levels, a single refinement block is introduced. 

For ease of implementation, in the AMR++ prototype the global grid must be uniform. This 
simplification of the global geometry was necessary in order to be able to concentrate on the major 
issues of this work, namely, the implementation of local refinement and self adaptivity in an 
object-oriented environment. This restriction is not critical and can be eased in future versions of 
the prototype. Aside from implementation issues, some additional functionality must be made 
available: 
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• For implicit solvers, the resulting domain decomposition of the global grid may require special 
capabilities within the single grid solvers (e.g., multigrid solvers for block structured grids with 
adequate smoothers, such as inter-block line or plane relaxation methods). 

• The block structures in the current AMR+-I- prototype are defined only by the needs of local 
refinement of a uniform global grid. This restriction allows them to be Cartesian. More 
complicated structures as they result from difficult non Cartesian external geometries (e.g., 
holes; see [11]) currently are not taken into consideration. An extension of AMR++, however, 
is principally possible. The wide experience for general 2D block structured grids that has been 
gained at GMD [11] can form a basis for these extensions. Whereas our work is comparably 
simple in 2D, because no explicit communication is required, extending the GMD work to 3D 
problems is very complex. 

Some Implementation Issues: In the following, some implementation issues are detailed. They 
also demonstrate the complexity of a proper and efficient treatment of block structured grids and 
adaptive refinement. AMR++ takes care of all of these issues, which would otherwise have to be 
handled explicitly at the application level. 

• Dimensional independence and multi-indexing: The implementation of most features of 
AMR++ and its user interface is dimensionally independent. Being derived from user 
requirements, on the lowest level, the AMR++ prototype is restricted to 2D and 3D 
applications. This, however, is a restriction that can easily be removed. 

One important means by which dimensional independence is reached, is multi-dimensional 
indices (multi-indices), which contain one index for each coordinate direction. On top of these 
multi-indices are index variants defined for each type of sub-block (interior, interior and 
boundary, boundary only, ...), which contain multiple multi-indices. For example, for 
addressing the boundary of a 3D block (non-convex), one multi-index is needed for each of the 
six planes. In order to avoid special treatment of physical boundaries, all index variants are 
defined twice, including and excluding the physical boundary, respectively. All index variants, 
several of them also including extended boundaries (see below), are precomputed at the time 
when a grid block is allocated. In the AMR-f-+ user interface and in the top level classes, only 
index variants or indicators are used and, therefore, allow a dimensionally independent 
formulation, except for very low level implementations. 

• Implementation of block structured grids: The AMR+- (- grid block objects consist of the 
interior, the boundary, an extended boundary of a grid block, and links that are formed 
between adjacent pairs of grid block objects. The links contain P+- 1- array objects that do not 
consist of actual data, but serve as views (subarrays) of the overlapping parts of the extended 
boundary between adjacent grid block objects. The actual boundaries that are shared between 
different blocks (block interfaces) are very complex structures that are represented properly in 
the grid block objects. For example, in 3D, interfaces between blocks are 2D planes, those 
between plane-interfaces are ID-line interfaces, and, further, those between line-interfaces are 
points (zero-dimensional). 

In Figure 1 (b), grid blocks 2.1.1 and 2.1.2 of the composite grid in Figure 1 (a) are depicted 
including their block interface and their extended boundary. The regular lines denote the 
outermost line of grid points of each block. Thus, with an extended boundary of two, there is 
one line of points between the block boundary line and the dashed line for the extended 
boundary. In its extended boundary, each grid block has views of the values of the original grid 
points of its adjoining neighboring block. This way it is possible to evaluate stencils on the 
interface and, with an extended boundary width of two, to also define a coarse level of the 
block structured refinement grid in multigrid sense. 

• Data structures and iterators: In AMR+- 1-, the composite grid is stored as a tree of all 
refinement grids, with the global grid being the root. Block structured grids are stored as lists 
of blocks (for ease of implementation; collections of blocks would be sufficient in most cases). 
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In Figure 1 (c), the composite grid tree for the example composite grid in Figure 1 (a) is 
illustrated. 

The user interface for doing operations on these data structures are so-called iterators. For 
example, for an operation on the composite grid (e.g., zeroing each level or interpolating a grid 
function to a finer level), an iterator is called that traverses the tree in the correct order 
(preorder, postorder, no order). This iterator as arguments takes the function to be executed 
and two indicators that specify the physical boundary treatment and the type of sub grid to be 
treated. The iteration starts at the root and recursively traverses the tree. For doing an 
operation (e.g. Jacobi relaxation) on a block structured grid, iterators are available, that 
process the list of blocks and all block interface lists. They take arguments similar to those for 
the composite grid tree iterators. 

Object-Oriented Design o,nd User Interface: The AMR-I — I- class libraries are customizable by 
using the object oriented features of CH — K For example, in order to obtain efficiency in the parallel 
environment, it may be necessary to introduce alternate iterators that traverse the composite grid 
tree or the blocks of a refinement region in a special order. This is implemented by alternate use of 
different base classes in the serial and parallel environment. The same is true for alternate 
composite grid cycling strategies as, for example, needed in AFAC, in contrast to FAC algorithms 
(Section 4). Application specific parts of AMR++, such as the single grid solvers or criteria for 
adaptivity, which have to be supplied by the user, are also simply specified through substitution of 
alternate base classes: A pre-existing application (e.g., problem setup and uniform grid solver) uses 
AMR++ to extend its functionality and to build an adaptive mesh refinement application. Thus 
the user supplies a solver class and some additional required functionality (refinement criteria, ...) 
and uses the functionality of the highest level AMR-I— I- ((Self.) Adaptive.) Composite-Grid class to 
formulate his special algorithm or to use one of the supplied PDE solvers. In the current prototype 
of AMR++, FAC and AFAC based solvers (Section 4) are supplied. If the single grid application is 
written using P++, then the resulting adaptive mesh refinement application is architecture 
independent, and so can be run efficiently in a parallel environment. 

The design and interface of AMR++ is object-oriented and the implementation of our 
prototype extensively uses features like encapsulation and inheritance: The. abstraction of 
self-adaptive local refinement, which involves the handling of many issues (including memory 
management, interface for application specific control, dynamic adaptivity, and efficiency), is 
reached through grouping these different functionalities in several interconnected classes. For 
example, memory management is greatly simplified by the object oriented organization of the 
AMR++ library: Issues such as lifetime of variables are handled automatically by the scoping rules 
for CH — h, so memory management is automatic and predictable. Also, the control over construction 
of the composite grid is intuitive and natural: The creation of composite grid objects is similar to 
the declaration of floating point or integer variables in procedural languages like Fortran and C. The 
user basically formulates a solver by allocating one of the predefined composite grid solver objects, 
or by formulating it on the basis of the composite grid objects and associated iterators and by 
supplying the single grid solver class. 

Although not part of the current implementation of AMR-I — h , C-l b introduces a template 
mechanism in the latest standardization of the language, which is only just beginning to be part of 
commercial products. The general purpose of this template language feature is to permit class 
libraries to access user specified base types. For AMR-I — h, for example, the template feature could 
be used to allow the specification of the base solver and adaptive criteria for the parallel adaptive 
local refinement implementation. In this way, the construction of an adaptive local refinement code 
from the single grid application on the basis of the AMR-I — I- class library can become .even simpler 
and cleaner. The object-oriented design of interconnected classes will not be further discussed. The 
reader is referred instead to [10] and [7]. 

Static and Dynamic Adaptivity. Grid Generation: In the current AMR-I— I- prototype, static 
adaptivity is fully implemented. The user can specify a composite grid either interactively or by 
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some input file: For each grid block, AMR++ needs its global coordinates and the parent grid 
block. Block structured local refinement regions are formed automatically by investigating 
neighboring relationships. In addition, the functionalities for adding and deleting grid blocks under 
user control are available within the Adaptive .Composite.Grid object of AMR++. 

Recently, dynamic adaptivity has been a subject of intensive research. Initial results are very 
promising, and some basic functionality has been included in the AMR++ prototype: Given a 
global grid, a flagging criteria function, and some stopping criteria, the 

Self_Adaptive_Composite_Grid object contains the functionality for iteratively solving on the actual 
composite grid and generating a new discretization level on top of the respective finest level. 
Building a new composite grid level works as follows: 

1. The flagging criteria delivers an unstructured collection of flagged points in each grid block. 

For representing grid block boundaries, all neighboring points of flagged points are also flagged. 

2. The new set of grid blocks to contribute to the refinement level (gridding) is built by applying 
a smart recursive bisection algorithm similar to the one developed in [2]: If building a rectangle 
around all flagged points of the given grid block is too inefficient, it is bisected in the longer 
coordinate direction and new enclosing rectangles are computed. The efficiency of the 
respective fraction is measured by the ratio of flagged points to all points of the new grid block. 
In the following tests, 75% is used. This procedure is repeated recursively if any of the new 
rectangles is also inefficient. Having the goal of building the rectangles as large as possible 
within the given efficiency constraint, the choice of the bisection point (splitting in halves is 
too inefficient because it results in very many small rectangles) is done by a combination of 
signatures and edge detection. A detailed description of this method reaches beyond the scope 
of this paper, so the reader is referred to [2] or [7]. 

3. Finally, the new grid blocks are added to the composite grid to form the new refinement level. 
Grouping these blocks into connected block structured grids is done the same way as it is done 
in the static case. 

This flagging and gridding algorithm has the potential for further optimization: The bisection 
method can be further improved, and a clustering and merging algorithm could be applied. This is 
especially true for refinement blocks of different parent blocks that could form one single block with 
more than one parent. Internal to AMR++, this kind of parent / child relationship is supported. 
The results in Section 5, however, show that the gridding already is quite good. The number of 
blocks that are constructed automatically is only slightly larger (< 10%) than a manual 
construction would deliver. A next step in self-adaptive refinement would be to support time 
dependent problems whose composite grid structure changes dynamically with time (e.g., moving 
fronts). In this case, in addition to adding and deleting blocks, enlarging and diminishing blocks 
must be supported. Though some basic functionality and the implementation of the general concept 
is already available, this problem has not yet been further pursued. 

Current State and Belated Work: The AMR++ prototype is implemented using M-H- and the 
AT&T Standard components class library to provide standardized classes (e.g., linked list classes). 
Through the shared interface of M++ and P++, AMR++ inherits all target architectures of P++. 
The prototype has been successfully tested on SUN workstations and on the Intel iPSC/860, where 
it has proved its full functionality with respect to parallelization. Taking into account the large 
application class of AMR++, there are still several insufficiencies and restrictions, as well as a large 
potential for optimization. For parallel environments, e. g., efficiently implementing self-adaptivity, 
including load (re)balancing, requires further research. In addition, the iterators that are currently 
available in AMR+- 1-, though working in a parallel environment, are best suited for serial 
environments. Special parallel iterators that, for example, support functional parallelism on the 
internal AMR+- 1- level would have to be provided. Until now, AMR-I-+ has been successfully used 
as a research tool for the algorithms and model problems described in the next two sections. 
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However, AMR++ provides the functionality to implement much more complicated application 
problems. 

Concerning parallelization, running AMR++ under P++ on the Intel iPSC/860 has proven its 
full functionality. Intensive optimization, however, has only been done within P++. AMR++ itseli 
offers a large potential for optimization. 

To the authors’ knowledge, the AMR++ approach is unique. There are several other 
developments in this area (e.g. [11]), but they either address a more restricted class of problems or 
are restricted to serial environments. 


MULTILEVEL ALGORITHMS WITH ADAPTIVE MESH REFINEMENT 


The fast adaptive composite grid method (FAC, [12]), which was originally developed from and 
is very similar to the Multi-Level Adaptive Technique (ML AT, [3]), is an algorithm that uses 
uniform grids, both global and local, to solve partial differential equations. This method is known 
to be highly efficient on scalar or single processor vector computers, due to its effective use ot 
uniform grids and multiple levels of resolution of the solution. On distributed memory m 
multiprocessors, methods like MLAT or FAC benefit from their tendency to create multiple isolated 
refinement regions, which may be effectively treated in parallel. However, for several problem 
classes, they s uff er from the way in which the levels of refinement are treated sequentially m eac 
region. Specifically, the finer levels must wait to be processed until the coarse-level approximations 
have been computed and passed to them; conversely, the coarser levels must wait until the hner 
level approximations have been computed and used to correct their equations. Thus, t e 
parallelization potential of these ’’hierarchical” methods is restricted to intra-level parallelization. 


The asynchronous fast adaptive composite method (AFAC) eliminates this bottleneck of 
parallelism. Through a simple mechanism used to reduce inter-level dependencies, individua 
refinement levels can be processed by AFAC in parallel. The result is that the convergence rate for 
AFAC is the square root of that for FAC. Therefore, since both AFAC and FAC have roughly the 
same number of floating point operations, AFAC requires twice the serial computational time as 
FAC, but AFAC allows for the introduction of inter-level parallelization. 


As opposed to the original development of FAC and AFAC, in this paper, the modified 
algorithms known as FACx and AFACx are discussed and used. They differ in the treatment ot the 
refinement levels. Whereas in FAC and AFAC, a rather accurate solution is computed (e.g , one 
MG V-cycle), FACx uses only a couple of relaxations. AFACx uses a two-grid procedure (ot 
FMG-type) on the refinement level and its standard coarsening with several relaxations on each ot 
these levels. Experiments and some theoretical observations show that all of the results that nave 
been obtained for FAC and AFAC also hold for FACx and AFACx (see [14]). In the following, FAC 
and AFAC always denote the modified versions (FACx and AFACx). 


Numerical algorithms: Both FAC (MLAT) and AFAC consist of two basic steps, which are 
described loosely as follows: 

1. Smoothing phase: Given the solution approximation and composite grid residuals on each level, 
use relaxation or some restricted multigrid procedure to compute a correction local to that level 
(a better approximation is required on the global grid, the finest uniform discretization level). 

2. Level transition phase: Combine the local corrections with the global solution approximation, 
compute the global composite grid residual, and transfer the local components ot the 
approximation and residual to each level. 

The difference between MLAT and FAC on the one hand and AFAC on the other hand is in the 
order in which the levels are processed and in the details of how they are combined: 


353 


• FAC and MLAT can roughly be viewed as standard multigrid methods with mesh refinement 
and a special treatment of the interfaces between the refinement levels and the underlying 
coarse level. In FAC and MLAT the treatment of the refinement levels is hierarchical. Theory 
on FAC is based on its interpretation as a multiplicative Schwarz Alternating Method or as a 
block relaxation method of Gauss-Seidel type. 

FAC and MLAT mainly differ by their motivation. Whereas it is the goal of FAC to compute a 
solution for the composite grid (grid points of the composite grid are all the interior points of 
the respective finest discretization level), the major goal of MLAT is to get the best possible 
solution on a given uniform grid (with using local refinement). Thus, in FAC, coarse levels of 
the composite grid serve for the computation of corrections. Therefore, FAC was originally 
formulated as a correction scheme (CS). The MLAT formulation requires a full approximation 
scheme (FAS), because coarse levels serve as correction levels for the points covered by finer 
levels. MLAT was first developed using finite difference discretization, whereas for FAC finite 
volume discretizations were used. However, they are closely related and in many problems lead 
to the same stencil representation. This is true except perhaps for the interface points, where 
finite volume discretizations generally lead to conservative discretizations (FAC), whereas finite 
difference discretizations do not (MLAT). Instead, in MLAT, usually a higher order 
interpolation is used on the interface. Other than this exception, because of the modification of 
the original FAC algorithm as discussed above, there is no difference in the treatment of the 
refinement levels between the original MLAT algorithm and the modified FAC algorithm that 
is discussed in this paper. It can be shown ([7]) that an FAS version of FAC with a special 
choice of the operators on the interface is equivalent to the originally developed Multilevel 
Adaptive Technique (MLAT). 

• AFAC on the other hand consists of the same discretization and operators as FAC, but a 
decoupled and asynchronous treatment of the refinement levels in the solution phase, which 
dominates the arithmetic work in the algorithm. Theory on AFAC can be based on its 
interpretation as an additive Schwarz Alternating Method or as a block relaxation method of 
Jacobi type. 

Theory in [12] shows that, und er app ropriate conditions, the convergence factors of FAC and 
AFAC have the relation Pafac — \/Pfac ■ This implies that two cycles of AFAC are roughly 
equivalent to one cycle of FAC. If the algorithmic components are chosen slightly different than for 
the convergence analysis or if applied to singular perturbation problems as discussed in the next 
section, experiences show that AFAC is usually better than as suggested by the above formula: In 
several cases, the convergence factor of AFAC shows only a slight degradation of the FAC rate 
(Section 5). 

Parallelization an Example, for the Use of P++/AMB.++: By example, we demonstrate 
some of the features of AMR++ and examples for the support of P++ for the design of parallel 
block structured local refinement applications on the basis of FAC and AFAC algorithms. 

In a parallel environment, partitioning the composite grid levels becomes a central issue in the 
performance of composite grid solvers. In Figure 2, two different partitioning strategies that are 
supported within P++/AMR++ are illustrated for the composite grid in Figure 2. For ease of 
illustration, grid blocks 2.2 and 2.3 are not included. The so-called FAC partitioning in Figure 2 (b) 
is typical for implicit and explicit algorithms, where the local refinement levels have to be treated in 
a hierarchical manner (FAC, MLAT,...). The so-called AFAC partitioning in Figure 2 (a) can be 
optimal for implicit algorithms that allow an independent and asynchronous treatment of the 
refinement levels. In the case of AFAC, however, it must be taken into consideration that this 
partitioning is only optimal for the solution phase, which dominates the arithmetic work of the 
algorithm. The efficiency of the level transition phase, which is based on the same hierarchical 
structure as FAC and which can eventually dominate the aggregate communication work of the 
algorithm, highly depends on the architecture and the application (communication / computation 
ratio, single node (vector) performance, message latency, transfer rate, congestion, ...). For 
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Figure 2: Parallel multilevel local refinement algorithms on block structured grids — an example for 
the use of AMR++ and the hidden interaction of the P+4- communication models. 

determining whether AFAC is better than FAC in a parallel environment, the aggregate efficiency 
and performance of both phases and the relation of the convergence rates must be properly 
evaluated. For more detail, see [10] and [7]. Both types of partitioning are supported in the 
P++/AMR++ environment. 

Solvers used on the individually partitioned composite grid levels make use of overlap updates 
within P++ array expressions, which automatically provide communication as needed. The 
inter-grid transfers between local refinement levels, typically located on different processors, rely on 
VSG updates. The VSG updates are also provided automatically by the P++ environment. Thus, 
the underlying support of parallelism is isolated in P++ through either overlap update or VSG 
update, or a combination of both, and the details of parallelism are isolated away from the AMR++ 
application. The block structured interface update is handled in AMR++. However, 
communication is hidden in P++ (mostly the VSG update). 

RESULTS FOR SINGULAR PERTURBATION PROBLEMS 

Use of the tools described above is now demonstrated with initial examples. The adaptivity 
provided by AMR++ is necessary in case of large gradients or singularities in the solution of the 
PDE. They may be due to rapid changes in the right-hand side or coefficients of the PDE, corners 
in the domain, or singular perturbations. Here, the first and the last case will be examined on the 
basis of model problems. 

Singularly perturbed PDEs represent the modelling of physical processes with relatively small 
diffusion (viscosity) and dominating convection. They may occur as a single equation or within 
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systems of complex equations, e.g., as the momentum equations within the Navier-Stokes or, in 
addition, as supplementary transport equations in the Boussinesq system of equations. Here, we 
merely treat a single equation. However, we only use methods that generalize directly to more 
complex situations. Therefore, we do not rely on the direct solution methods provided by 
downstream or ILU relaxations for simple problems with pure upstream discretization. The latter 
are not direct solution methods for systems of equations. Further, these types of flow direction 
dependent relaxations are not efficiently parallelizable in the case of only a few relaxations as is 
usually used in multilevel methods. This in particular holds on massively parallel systems. 



Figure 3: Results for a singular perturbation problem: Plots of the error and composite grid, with 
two different choices of the accuracy 77 in the self-adaptive refinement process. 


Model Problem and Solvers: Numerical results have been obtained for the model problem 

—eAu + au x + bu y = / on f2 = (0, l) 2 

with Dirichlet boundary conditions on dQ and e = 0.00001. This problem serves as a good model 
for complex fluid flow applications, because several of the properties that are related to self-adaptive 
mesh refinement are already present in this simple problem. The equation is discretized using 
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isotropic artificial viscosity (diffusion): 


L h := -e h A h + aD2h, x u + bD 2h , v u with A h = D 2 h>x + D 2 hy 
Eh := max{e, /3hmax{|a|, |6|}/2} 


The discrete system is solved by multilevel methods - MG on the finest global grid and FAC or 
AFAC on composite grids with refinement. For the multigrid method, it is known that, with 
artificial viscosity, the two-grid convergence factor (spectral radius of the corresponding iteration 
matrix) is bounded below by 0.5 (for h — > 0). Therefore, multilevel convergence factors converge to 
1.0 with an increasing number of levels. In [5], a multigrid variant which shows surprisingly good 
convergence behavior has been developed: MG convergence factors stay far below 0.5 (with three 
relaxations on each level). Here, essentially this method is used, which is described as follows: 

• Discretization with additional isotropic artificial viscosity using (3 = 3 on the finest grid m and 
/3j_i = 1/2 (Pi + 1 /Pi) for coarser grids l = m — 1, m — 2 , . . ., 

• MG components: odd/even relaxation, non-symmetric transfer operators corresponding to 
linear finite elements. These components fulfil the Galerkin condition for the Laplacian. 

Anisotropic artificial viscosity may also be used, but generally requires (parallel) zebra line 
relaxation, which has not yet been fully implemented. 

For FAC and AFAC, the above MG method with V(2,l) cycling is used as a global grid solver. 
On the refinement levels, three relaxations are performed, and P = 3 is chosen on refinement grids. 

Convergence Results: In Table 1, several convergence factors for FAC, AFAC, and, for 
comparison, for MG are shown. The finest grids have mesh sizes of h = 1 / 64 or h = 1/512, 
respectively. For FAC and AFAC, the global grid has the mesh size h = 1/32, the (predetermined) 
fine block always covers 1/2 of the parent coarse block along the boundary layer. The following 
conclusions can be drawn: 

• For MG, the results are as expected. In the case of FAC and AFAC, the choice of P has to be 
further investigated. 

• V cycles are used; W or F cycles would yield better convergence rates but worse parallel 
efficiency. 

• If p(FAC ) is small, the expected result p(AFAC) « p(F AC) can be observed, otherwise 
p(FAC) « p(AFAC) < ^Jp(FAC). 



Poisson 

SPP: P = 3 

SPP: P = 1 

h 

1/64 

1/512 

1/64 

1/512 

1/64 

1/512 

MG-V 

0.14 

0.14 

0.17 

0.30 

0.18 

0.50 

FAC 

0.17 

0.18 

0.30 

0.65 

0.30 

0.80 

AFAC 

0.40 

0.41 

0.41 

0.67 

0.45 

0.95 


Table 1: Convergence factors for a singular perturbation problem (SPP: a = b = l,e = 0.00001) and, 
for comparison, for Poisson’s equation. 


Self-Adaptive Mesh Refinement Results: More interesting for the goal of this paper are 
applications of the self-adaptive process. As opposed to the convergence rates, they do not depend 
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only on the PDE, but also on the particular solution. The results in this paper have been obtained 
for the exact solution 


Jx-\)/e _ p-l/e i „ 

u (x) = - - l - e -100(i 2 +(y-l) 2 ) 

U[X) l-c-1/* 2 


which has a boundary layer forx = l,0 <y<l and a steep hill around x = 0, y = 1. In order to 
measure the error of the approximate solution, a discrete approximation to the L\ error norm is 
used. This is appropriate for this kind of problem: For solutions with discontinuities of the above 
type, one can observe 1st order convergence only with respect to this norm (no convergence in the 
Lqo norm, order 0.5 in the Li norm). 


The results have been obtained using the flagging criteria 


h f [0hmax{\a\,\b\} {\D 2 h>x u\ + \Dl y u\)] > 77 

with a given value of 77. For e < e^, the second factor is an approximation to the lowest order error 
term of the discretization. Based on experiments, / = 1 is a good choice. Starting with the global 
grid, the composite grid is self- adaptively built on the basis of the flagging and gridding algorithm 
described in Section 3. 



MG-V 




FAC 






uniform 

V 

= 0.02 


■n 

= 0.01 


77 = 

= 0.001 


h 

e 

n 

e 

71 

b 

e 

71 

b 

e 

n 

b 

1/32 

0.0293 

961 

0.0293 

961 

1 

0.0293 

961 

1 

0.0293 

961 

1 

1/64 

0.0159 

3969 

0.0160 

1806 

4 

0.0160 

1967 

4 

0.0159 

2757 

3 

1/128 

0.0083 

16129 

0.0089 

3430 

10 

0.0087 

3971 

10 

0.0083 

6212 

7 

1/256 

0.0043 

65025 

0.0056 

6378 

19 

0.0051 

7943 

16 

0.0043 

13473 

12 

1/512 

0.0023 

261121 

0.0073 

12306 

34 

0.0044 

15909 

30 

0.0023 

27410 

22 


Table 2: Accuracy (Ll-norm e) vs. the number of grid points (71) and the number of blocks (5) for 
MG-V on a uniform grid and FAC on self-adaptively refined composite grids. 


In Table 2, the results for MG and FAC are presented for three values of 77. In Figure 3, two of 
the corresponding block structured grids are displayed. The corresponding error plots give an 
impression of the error distribution restricted from the composite grid to the global uniform grid. 
Thus, larger errors near the boundary layer are not visible. The results allow the following 
conclusions: 

• In spite of the well known difficulties in error control of convection dominated problems, the 
grids that are constructed self-adaptively are reasonably well suited to the numerical problem. 

• As long as the accuracy of the finest level is not reached, the error norm is approximatively 
proportional to 77. As usual in error control by residuals, with the norm of the inverse operator 
being unknown, the constant factor is not known. 

• If the refinement grid does not properly match the local activity, convergence rates significantly 
degrade and the error norm may even increase. 

• Additional tests have shown that, if the boundary layer is fully resolved with an increased 
number of refinement levels, the discretization order, as expected, changes from one to two. 

• The gridding algorithm is able to treat very complicated refinement structures efficiently: The 
number of blocks that are created is nearly minimal (compared to hand coding) . 


358 



• Though this example needs relatively large refinement regions, the overall gain by using 
adaptive grids is more than 3.5 (taking into account the different number of points and the 
different convergence rates). For pure boundary layer problems, factors larger than 10 have 
been observed. 

• These results have been obtained in a serial environment. AMR++, however, has been 
successfully tested in parallel. For performance and efficiency considerations, see Sect. 2 and 3. 
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